TOGA is a method that integrates gene annotation, inferring orthologs, and classifying genes as intact or lost. We want to use TOGA to detect genetic innovations in Lasallia pustulata. Therefore, we are searching for genes that are present in Lasallia hispanica but lost in Lasallia pustulata.
conda create --name toga python=3.7.3 conda activate toga conda install -c bioconda nextflow conda install -c anaconda biopython
git clone https://github.com/hillerlab/TOGA.git cd TOGA python3 -m pip install -r requirements.txt --user ./configure.sh # ignore the warnings ./run_test.sh micro
Your output should look something like this in the end:
#### STEP 11: Cleanup: merge parallel steps output files Saved results to /home/ecoevo01/TOGA/TOGA/micro_test_out Done! Estimated time: 0:01:40.027813 Program finished with exit code 0 JH567521 463144 506100 ENST00000262455.1169 1000 - 463144 506100 0,200,255 8 102,103,142,112,117,58,116,185,0,1982,30295,31351,36911,38566,41322,42771, JH567521 395878 449234 ENST00000259400.1169 1000 + 395878 449234 0,0,200 7 123,66,226,116,51,87,240, 0,11871,38544,45802,45994,52305,53116, JH567521 299723 336583 ENST00000618101.1169 1000 + 299723 336583 0,0,200 7 28,923,130,173,200,179,248, 0,1256,6085,6677,19146,21311,36612, Success!
TOGA uses at least 3 different input data:
We need to create the required inputs for TOGA from our assemblies. This require different tools, including lastz, axtChain, chainNet, chainSort, chainCleaner, faToTwoBit, gff3ToGenePred, GenePredToBed, RepeatFiller and some self-written scripts. The tools from UCSC (such as faToTwoBit, axtChain,..) are pre-compiled and can be downloaded here using the command below. Please note that you need to find the correct name for the tools that are displayed on that website, for example lastz-1.04.00 for lastz.
rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/<tool_name> ./
We have downloaded and installed all of them for you Add the following folder to you variable $PATH by adding this line to your
~/.bashrc file:
export PATH=/share/gluster/Projects/vinh/ecoevo/src/for_toga:$PATH
Then, apply the change with source ~/.bashrc (or restart the terminal) and re-activate toga conda environment.
Now, we can generate the input files for TOGA.
faToTwoBit L_hispanica_assembly.fasta L_hispanica.2bit faToTwoBit L_pustulata_assembly.fasta L_pustulata.2bit
lastz L_hispanica.2bit L_pustulata.2bit --action:target=multiple --hspthresh=2400 --gappedthresh=3000 --format=axt --out=hispanica_pustulata_aln.axt
axtChain hispanica_pustulata_aln.axt L_hispanica.2bit L_pustulata.2bit hispanica_pustulata_aln.chain -linearGap=medium
python /share/gluster/Projects/vinh/ecoevo/src/for_toga/fasta_len.py <LasHis.fa> <LasHis.sizes> python /share/gluster/Projects/vinh/ecoevo/src/for_toga/fasta_len.py <LasPus.fa> <LasPus.sizes> # make sure that the identifiers in your .sizes files match the ones in your .axt file
# run chainSort chainSort hispanica_pustulata_aln.chain hispanica_pustulata_aln.sorted.chain # run chainNet chainNet hispanica_pustulata_aln.sorted.chain L_hispanica.sizes L_pustulata.sizes hispanica.net pustulata.net
chainCleaner hispanica_pustulata_aln.chain L_hispanica.2bit L_pustulata.2bit hispanica_pustulata_aln.cleaned.chain hispanica_pustulata_aln.bed -net=hispanica.net linearGap=medium
RepeatFiller.py -c hispanica_pustulata_aln.cleaned.chain -T2 L_hispanica.2bit -Q2 L_pustulata.2bit | tee hispanica_pustulata_aln.cleaned.filled.chain
hispanica_pustulata_aln.cleaned.filled.chain) is the chain file that we want to use as an input for TOGA. Sometimes the RepeatFiller output includes empty lines which must be deleted. Use the following command: sed -i '/^$/d' hispanica_pustulata_aln.cleaned.filled.chain
gff3ToGenePred L_hispanica.gff3 L_hispanica.genePred genePredToBed L_hispanica.genePred L_hispanica.bed
L_hispanica.bedhispanica_pustulata_aln.cleaned.filled.chainL_hispanica.2bit and L_pustulata.2bitUse a Slurm script with 12 CPUs and 8GB memory to run TOGA
python3 /path/to/your/downloaded/TOGA/toga.py hispanica_pustulata_aln.cleaned.filled.chain L_hispanica.bed L_hispanica.2bit L_pustulata.2bit
TOGA produces different output files that you can look up here. The most interesting output file for us is loss_summ_data.tsv, which contains the classifications for each projection, transcript, and gene. TOGA identifies the following classes:
N - no data due to technical reasons (such as CESAR memory requirements)
PG - no orthologous chains identified, TOGA projected transcripts via paralogous chains and cannot make any conclusion
PM - partial & missing. Most of the projection lies outside scaffold borders
L - clearly lost (several inactivating mutations)
M - missing, assembly gaps mask >50% of the prediction CDS
G - "grey", there are inactivating mutations but not enough evidence for "clearly lost" class. In other words: neither lost nor intact
PI - partially intact: some fraction of CDS is missing, but most likely this is intact
I - clearly intact
UL - uncertain loss (few inactivating mutations localized in a single exon)
We are interested in the genes that are present in L. hispanica but lost in L. pustulata not by technical artifacts (e.g. errors from gene annotation process).
loss_summ_data.tsv. These four sets of genes will form our genes of interestFrom our previous analysis, we obtained a list of genes, which are missing in L. pustulata but present in L. hispanica. But,
In order to answer those questions, we are going to study their phylogenetic profiles across a wide range of species by using fDOG and PhyloProfile. fDOG is a directed ortholog search tool, which can look for orthologs of a single gene of interest in a set of search taxa. These search organisms can be diverse, from closely related to distantly taxa to our reference species, from which the seed genes are originated.
The taxon set we will use for this analysis consists of L. hispanica, L. hispanica, the close related species U. muehlenbergii, and the set of 78 QfO taxa (what is the “QfO”? Why do we work with these 78 taxa?).