This is an old revision of the document!

Warning: Undefined array key 1 in /var/www/html/teaching/wiki/lib/plugins/fontsize2/syntax.php on line 49

taXaminer

Whole genome shotgun data are a great resource for reconstructing the genome of any target species. At the same time, there is a rich source of contaminations, i.e. reads from taxa other than the one you had in mind. Possible contaminations are

taxa living in close association with your target species. Most prominent examples are bacteria of the gut or skin microbiome, or of symbiotic partners.
reads representing the genome of the person who handled the data, i.e. human contamination
contaminated reagents used for extracting or sequencing the DNA
contamination in the sequencer

There are two main levels to detect such contaminations, either on the level of the genome assembly, e.g. with the help of BlobTools, or on the gene set level. We will focus on the latter, since it allows to investigate the nature of the contaminations.

We will use the software taXaminer (Fig. 1) to characterize the gene set of C. hominis. Next to performing the taxonomic assignment using Diamond searches against the NCBI nrProt database, taXaminer determines values for a number of other gene features, such as read coverage, standard deviation of read coverage from contig mean, gene length, position of the gene (terminal in contig or not), etc. taXaminer runs then a PCA on these feature vectors and returns, next to other information, a 3D plot of the taxonomically labeled PCA in html format. To make full use of the taXaminer output, we have developed the tX-dashboard that you can install locally on your computer.

taXaminer analysis

What you need

The genome sequence in fasta format. We will be using /home/ubuntu/Share/Assemblies/crypto_BCM2021_v2.fasta

the annotation file in gff3. We will be using

/home/ubuntu/Share/Analysis/taxaminer/results/metaeuk/Crypto_Metaeuk.sorted.gff3

optionally: the protein fasta file. taXaminer will extract the protein sequences from the gff file if not provided.
optionally: read mapping information: One BAM file per library

What you get

a per-base coverage file. Only when mapping information was provided
a taxonomic assignment for each gene based on a modified version of the DIAMOND Last Common Ancestor algorithm¹⁾
a file with feature vectors for each gene in CSV
a html-file with the PCA as a 3D plotly plot

Running taXaminer

Check for the presence of taXaminer on your system. To do so:
1. activate the conda environment:
```
conda activate /home/ubuntu/anaconda3/envs/taxaminer
```
2. issue the following command to test if you can run taxaminer:
```
taxaminer.run -h 
```
  1. if it installed, proceed with the next steps
  2. if it is not:
    InstallTaxaminer
    
    InstallTaxaminer
    
    install taXaminer from the GitHub page according to the guidelines.
    
    download the NCBI nonredundant protein database from NCBI: wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
OptionallyInstall the taXaminer-dashboard on your local computer²⁾
create a working directory for the analysis:
```
mkdir -p $HOME/Analysis/taxaminer
```
change into the working directory:
```
cd $HOME/Analysis/taxaminer
```
copy or soft-link the following files into the working directory
- The genome sequence in fasta file
- The genome annotation in gff3 format
- any read mapping information in BAM format

edit the config-script according to your needs

ConfigScriptExample

## Input and output options ##
# this section is the minimum information that is required and must be stated
fasta_path: "AddYourInputDataHere" # path to assembly FASTA
gff_path: "AddYourInputDataHere" # path to GFF
output_path: "AddYourInputDataHere" # directory to save results to
taxon_id: "AddYourInputDataHere" # NCBI Taxon ID of query species. Cryptosporidium parvum has the taxon id 5807

#############################################################
###### from here onwards, the info is optional ##############
## Coverage options ##
# state one of the following files to include coverage information
bam_path_1: "" # path to BAM; file; omit to use default location in output directory
bam_path_2: ""
# to add further coverage sets duplicate the parameter you need and increase the number in the suffix

## Taxonomic assignment options ##
taxon_exclude: "TRUE" # exclude query taxon from taxonomic assignment [TRUE/FALSE]
assignment_mode: "exhaustive" # mode for taxonomic assignment [exhaustive/quick]; see Documentation for details

## PCA options ##
# gene descriptors to be used in the PCA; see Documentation for details on options
input_variables: "c_name,c_num_of_genes,c_len,c_genelenm,c_genelensd,g_len,g_lendev_c,g_abspos,g_terminal,c_cov,c_covsd,g_cov,g_covsd,g_covdev_c,c_pearson_r,g_pearson_r_o,g_pearson_r_c"

## Plot output options ##
update_plots: "FALSE" # only update the plots (use if you changed settings below) [TRUE/FALSE]
num_groups_plot: "25" # number of taxa to display in plot (taxa are automatically merged at higher ranks) [X/all]
merging_labels: [] # influence the merging of taxa; see Documentation for details on options

activate the taxaminer conda environment unless you have already done so³⁾:
```
conda activate /home/ubuntu/anaconda3/envs/taxaminer
```
run taXaminer by issuing the following command⁴⁾
```
taxaminer.run config.yml
```
Make sure that you are either in the directory where the config.yml is located, or provide the path.

Once, the taXaminer run has completed, you can download the information to your local computer. Then you can either open the 3D_plot.html directly in a web browser, or you use the taXaminer-dashboard to first import the output folder and then load the data.

Are the results in line with your expectations
Do you find anything suspicious?

ForTheImpatient

each dot in the PCA represents one protein coding gene as it was annotated by MetaEuk2 in the Cryptosporidium parvum genome assembly CryPa_BCM2021a.fasta
the position of the dot is determined by a multidimensional feature vector capturing for each gene information such as position in the assembly, word frequency, GC content, #of genes on contig, etc.
the color code informs about the taxonomic assignment of the gene
the plot is interactive such that detailed information can be retrieved and downloaded for each gene

Check out the taXaminer-dashboard: example input

¹⁾

We modified it a bit because we have a strong prior from which organism the gene should come from

²⁾

This is not necessary, but it makes more fun to look at the data via the dashboard

³⁾

It is a good idea to deactivate the environment that you are currently in

⁴⁾

make sure that the correct conda environment is active

Tools

menus and quick search

quick search

site status

Page Tools

meta data for this page

taXaminer

taXaminer analysis

Running taXaminer