======= taXaminer ======
Whole genome shotgun data are a great resource for reconstructing the genome of any target species. At the same time, there is a rich source of contaminations, i.e. reads from taxa other than the one you had in mind. Possible contaminations are
   * taxa living in close association with your target species. Most prominent examples are bacteria of the gut or skin microbiome, or of symbiotic partners.
   * reads representing the genome of the person who handled the data, i.e. human contamination
   * contaminated reagents used for extracting or sequencing the DNA
   * contamination in the sequencer
There are two main levels to detect such contaminations, either on the **level of the genome assembly**, e.g. with the help of BlobTools, or on the **gene set level**. We will focus on the latter, since it allows to investigate the nature of the contaminations.

<WRAP round box>
We will use the software **taXaminer** (Fig. {{ref>taXaminer}}) to characterize the gene set of //C. hominis//. Next to performing the taxonomic assignment using **Diamond searches against the NCBI nrProt database**, taXaminer determines **values for a number of other gene features**, such as read coverage, standard deviation of read coverage from contig mean, gene length, position of the gene (terminal in contig or not), etc. **taXaminer runs then a PCA** on these feature vectors and returns, next to other information, a **3D plot of the taxonomically labeled PCA** in html format. To make full use of the taXaminer output, we have developed the tX-dashboard that you can install locally on your computer.
</WRAP>

<figure taXaminer>
{{ :general:bioseqanalysis:images:taxaminer-cremanei.gif?600 |}}
<caption><fs>Result of a taxaminer analysis on the gene set of a nematode. Each dot in the PCA represent a gene that is annotated in the nematode genome. Each gene is annotated with a number of features such as 'numbers of genes on contig', 'position on contig', 'GC content' and the like and a PCA was performed to project the multi-dimensional vectors into 3D space. The dot color represents the taxonomic assignment. If you click on the image, the GIF will be animated</fs></caption>
</figure>
====== taXaminer analysis =====
**What you need**
  - The genome sequence in fasta format. We will be using ''/home/ubuntu/Share/Assemblies/crypto_BCM2021_v2.fasta''
  - the annotation file in gff3. We will be using <code>/home/ubuntu/Share/Analysis/taxaminer/results/metaeuk/Crypto_Metaeuk.sorted.gff3</code>
  - **optionally:** the protein fasta file. :!: taXaminer will extract the protein sequences from the gff file if not provided.
  - **optionally:** read mapping information: One BAM file per library ''/home/ubuntu/fritz/sv-detection/for_ingo/illumina_pairs.mapped.sort.bam''
  - **optionally** a local installation of the [[https://github.com/bionf/taxaminer-dashboard|taxaminer-dashboard]]. 

**What you get**
  - a taxonomic assignment for each gene based on a modified version of the DIAMOND Last Common Ancestor algorithm((We modified it a bit because we have a strong prior from which organism the gene should come from))
  - a file with feature vectors for each gene in CSV
  - a html-file with the PCA as a 3D plotly plot
  - a file with the proteins encoded by the annotated genes
  - a text file with the diamond hits 

==== Running taXaminer ====
  - Check for the presence of taXaminer on your system. To do so:
    - activate the conda environment: <WRAP>
<code>conda activate /home/ubuntu/miniconda3/envs/taxaminer</code>
</WRAP>
    - issue the following command to test if you can run taxaminer: <WRAP>
<code>taxaminer.run -h </code>
</WRAP>
      - if it installed, proceed with the next steps
      - **if it is not**:<WRAP>
<hidden InstallTaxaminer>
  * install taXaminer from the [[https://github.com/BIONF/taXaminer|GitHub page]] according to the guidelines.
  * download the NCBI nonredundant protein database from NCBI: ''wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz''
</hidden>
</WRAP>
  - <wrap important>Optionally</wrap>Install the [[https://github.com/BIONF/taxaminer-dashboard/|taXaminer-dashboard]] on your local computer((This is not necessary, but it makes more fun to look at the data via the dashboard =) ))
  - create a working directory for the analysis: <code>mkdir -p $HOME/Analysis/taxaminer</code>
  - change into the working directory: <code>cd $HOME/Analysis/taxaminer</code>
  -  copy or soft-link the following files into the working directory
    * The genome sequence in fasta file
    * The genome annotation in gff3 format
    * any read mapping information in BAM format 
  - we will be using the unref50 database for the Diamond search. ''$HOME/Share/DBs/uniref50/db.dmnd''
  - edit the config-script according to your needs<WRAP>
<hidden ConfigScriptExample>
<code>
## Input and output options ##
# this section is the minimum information that is required and must be stated
fasta_path: "AddYourInputDataHere" # path to assembly FASTA
gff_path: "AddYourInputDataHere" # path to GFF
output_path: "AddYourInputDataHere" # directory to save results to
taxon_id: "AddYourInputDataHere" # NCBI Taxon ID of query species. Cryptosporidium parvum has the taxon id 5807
database_path: "$HOME/Share/DBs/uniref50/db.dmnd"

#############################################################
###### from here onwards, the info is optional ##############
## Coverage options ##
# state one of the following files to include coverage information
bam_path_1: "" # path to BAM; file; omit to use default location in output directory
bam_path_2: ""
# to add further coverage sets duplicate the parameter you need and increase the number in the suffix

## Taxonomic assignment options ##
taxon_exclude: "TRUE" # exclude query taxon from taxonomic assignment [TRUE/FALSE]
assignment_mode: "exhaustive" # mode for taxonomic assignment [exhaustive/quick]; see Documentation for details

## PCA options ##
# gene descriptors to be used in the PCA; see Documentation for details on options
input_variables: "c_name,c_num_of_genes,c_len,c_genelenm,c_genelensd,g_len,g_lendev_c,g_abspos,g_terminal,c_cov,c_covsd,g_cov,g_covsd,g_covdev_c,c_pearson_r,g_pearson_r_o,g_pearson_r_c"

## Plot output options ##
update_plots: "FALSE" # only update the plots (use if you changed settings below) [TRUE/FALSE]
num_groups_plot: "25" # number of taxa to display in plot (taxa are automatically merged at higher ranks) [X/all]
merging_labels: [] # influence the merging of taxa; see Documentation for details on options
</code>
</hidden>
</WRAP>
  - run taXaminer by issuing the following command((make sure that the correct conda environment is active))<WRAP><code>taxaminer.run config.yml</code>:!: Make sure that you are either in the directory where the config.yml is located, or provide the path.
</WRAP>
Once, the taXaminer run has completed, you can {{:physaliacg:2024:crypa_bcm2021-renamed_metaeuk.tar.gz |download the information}} to your local computer. Then you can either open the 3D_plot.html directly in a web browser, or you use the [[https://github.com/BIONF/taXaminer-dashboard|taXaminer-dashboard]] to first import the output folder and then load the data. 

  - :?: Are the results in line with your expectations
  - :?: Do you find anything suspicious?

<hidden ForTheImpatient>
{{ :physaliacg:2024:chrpa-bcm2021_metaeuk.mp4 |}}
  * each dot in the PCA represents one protein coding gene as it was annotated by MetaEuk2 in the Cryptosporidium parvum genome assembly CryPa_BCM2021a.fasta
  * the position of the dot is determined by a multidimensional feature vector capturing for each gene information such as position in the assembly, word frequency, GC content, #of genes on contig, etc.
  * the color code informs about the taxonomic assignment of the gene
  * the plot is interactive such that detailed information can be retrieved and downloaded for each gene
<wrap important></wrap>Check out the [[https://github.com/BIONF/taXaminer-dashboard|taXaminer-dashboard]]: {{ :physaliacg:2024:crypa_bcm2021-renamed_metaeuk.tar.gz |example input}}  
</hidden>
----

<WRAP tabs>
  * [[:physaliacg|Physalia main page]]
  * [[:general:bioseqanalysis:genesetanalysis|Physalia gene set characterization page]]
  * [[:general:bioseqanalysis:genesetanalysis:busco|Proceed with BUSCO]]
</WRAP>