======= taXaminer ======
Whole genome shotgun data are a great resource for reconstructing the genome of any target species. At the same time, there is a rich source of contaminations, i.e. reads from taxa other than the one you had in mind. Possible contaminations are
* taxa living in close association with your target species. Most prominent examples are bacteria of the gut or skin microbiome, or of symbiotic partners.
* reads representing the genome of the person who handled the data, i.e. human contamination
* contaminated reagents used for extracting or sequencing the DNA
* contamination in the sequencer
There are two main levels to detect such contaminations, either on the **level of the genome assembly**, e.g. with the help of BlobTools, or on the **gene set level**. We will focus on the latter, since it allows to investigate the nature of the contaminations.
We will use the software **taXaminer** (Fig. {{ref>taXaminer}}) to characterize the gene set of //C. hominis//. Next to performing the taxonomic assignment using **Diamond searches against the NCBI nrProt database**, taXaminer determines **values for a number of other gene features**, such as read coverage, standard deviation of read coverage from contig mean, gene length, position of the gene (terminal in contig or not), etc. **taXaminer runs then a PCA** on these feature vectors and returns, next to other information, a **3D plot of the taxonomically labeled PCA** in html format. To make full use of the taXaminer output, we have developed the tX-dashboard that you can install locally on your computer.
{{ :general:bioseqanalysis:images:taxaminer-cremanei.gif?600 |}}
Result of a taxaminer analysis on the gene set of a nematode. Each dot in the PCA represent a gene that is annotated in the nematode genome. Each gene is annotated with a number of features such as 'numbers of genes on contig', 'position on contig', 'GC content' and the like and a PCA was performed to project the multi-dimensional vectors into 3D space. The dot color represents the taxonomic assignment. If you click on the image, the GIF will be animated
====== taXaminer analysis =====
**What you need**
- The genome sequence in fasta format. We will be using ''/home/ubuntu/Share/Assemblies/crypto_BCM2021_v2.fasta''
- the annotation file in gff3. We will be using /home/ubuntu/Share/Analysis/taxaminer/results/metaeuk/Crypto_Metaeuk.sorted.gff3
- **optionally:** the protein fasta file. :!: taXaminer will extract the protein sequences from the gff file if not provided.
- **optionally:** read mapping information: One BAM file per library ''/home/ubuntu/fritz/sv-detection/for_ingo/illumina_pairs.mapped.sort.bam''
- **optionally** a local installation of the [[https://github.com/bionf/taxaminer-dashboard|taxaminer-dashboard]].
**What you get**
- a taxonomic assignment for each gene based on a modified version of the DIAMOND Last Common Ancestor algorithm((We modified it a bit because we have a strong prior from which organism the gene should come from))
- a file with feature vectors for each gene in CSV
- a html-file with the PCA as a 3D plotly plot
- a file with the proteins encoded by the annotated genes
- a text file with the diamond hits
==== Running taXaminer ====
- Check for the presence of taXaminer on your system. To do so:
- activate the conda environment: conda activate /home/ubuntu/miniconda3/envs/taxaminer
- issue the following command to test if you can run taxaminer: taxaminer.run -h
- if it installed, proceed with the next steps
- **if it is not**:
* install taXaminer from the [[https://github.com/BIONF/taXaminer|GitHub page]] according to the guidelines.
* download the NCBI nonredundant protein database from NCBI: ''wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz''
- OptionallyInstall the [[https://github.com/BIONF/taxaminer-dashboard/|taXaminer-dashboard]] on your local computer((This is not necessary, but it makes more fun to look at the data via the dashboard =) ))
- create a working directory for the analysis: mkdir -p $HOME/Analysis/taxaminer
- change into the working directory: cd $HOME/Analysis/taxaminer
- copy or soft-link the following files into the working directory
* The genome sequence in fasta file
* The genome annotation in gff3 format
* any read mapping information in BAM format
- we will be using the unref50 database for the Diamond search. ''$HOME/Share/DBs/uniref50/db.dmnd''
- edit the config-script according to your needs
## Input and output options ##
# this section is the minimum information that is required and must be stated
fasta_path: "AddYourInputDataHere" # path to assembly FASTA
gff_path: "AddYourInputDataHere" # path to GFF
output_path: "AddYourInputDataHere" # directory to save results to
taxon_id: "AddYourInputDataHere" # NCBI Taxon ID of query species. Cryptosporidium parvum has the taxon id 5807
database_path: "$HOME/Share/DBs/uniref50/db.dmnd"
#############################################################
###### from here onwards, the info is optional ##############
## Coverage options ##
# state one of the following files to include coverage information
bam_path_1: "" # path to BAM; file; omit to use default location in output directory
bam_path_2: ""
# to add further coverage sets duplicate the parameter you need and increase the number in the suffix
## Taxonomic assignment options ##
taxon_exclude: "TRUE" # exclude query taxon from taxonomic assignment [TRUE/FALSE]
assignment_mode: "exhaustive" # mode for taxonomic assignment [exhaustive/quick]; see Documentation for details
## PCA options ##
# gene descriptors to be used in the PCA; see Documentation for details on options
input_variables: "c_name,c_num_of_genes,c_len,c_genelenm,c_genelensd,g_len,g_lendev_c,g_abspos,g_terminal,c_cov,c_covsd,g_cov,g_covsd,g_covdev_c,c_pearson_r,g_pearson_r_o,g_pearson_r_c"
## Plot output options ##
update_plots: "FALSE" # only update the plots (use if you changed settings below) [TRUE/FALSE]
num_groups_plot: "25" # number of taxa to display in plot (taxa are automatically merged at higher ranks) [X/all]
merging_labels: [] # influence the merging of taxa; see Documentation for details on options
- run taXaminer by issuing the following command((make sure that the correct conda environment is active))taxaminer.run config.yml:!: Make sure that you are either in the directory where the config.yml is located, or provide the path.
Once, the taXaminer run has completed, you can {{:physaliacg:2024:crypa_bcm2021-renamed_metaeuk.tar.gz |download the information}} to your local computer. Then you can either open the 3D_plot.html directly in a web browser, or you use the [[https://github.com/BIONF/taXaminer-dashboard|taXaminer-dashboard]] to first import the output folder and then load the data.
- :?: Are the results in line with your expectations
- :?: Do you find anything suspicious?
{{ :physaliacg:2024:chrpa-bcm2021_metaeuk.mp4 |}}
* each dot in the PCA represents one protein coding gene as it was annotated by MetaEuk2 in the Cryptosporidium parvum genome assembly CryPa_BCM2021a.fasta
* the position of the dot is determined by a multidimensional feature vector capturing for each gene information such as position in the assembly, word frequency, GC content, #of genes on contig, etc.
* the color code informs about the taxonomic assignment of the gene
* the plot is interactive such that detailed information can be retrieved and downloaded for each gene
Check out the [[https://github.com/BIONF/taXaminer-dashboard|taXaminer-dashboard]]: {{ :physaliacg:2024:crypa_bcm2021-renamed_metaeuk.tar.gz |example input}}
----
* [[:physaliacg|Physalia main page]]
* [[:general:bioseqanalysis:genesetanalysis|Physalia gene set characterization page]]
* [[:general:bioseqanalysis:genesetanalysis:busco|Proceed with BUSCO]]