general:legacy_backup [Teaching]

This is an old revision of the document!

Task 1 - Projecting gene annotations with TOGA

TOGA is a method that integrates gene annotation, inferring orthologs, and classifying genes as intact or lost. We want to use TOGA to detect genetic innovations in Lasallia pustulata. Therefore, we are searching for genes that are present in Lasallia hispanica but lost in Lasallia pustulata.

Installation

Create an own environment for TOGA and install nextflow:

conda create --name toga python=3.7.3
conda activate toga  
conda install -c bioconda nextflow 
conda install -c anaconda biopython

Clone TOGA from github and install it:

git clone https://github.com/hillerlab/TOGA.git
cd TOGA
python3 -m pip install -r requirements.txt --user
./configure.sh
# ignore the warnings
./run_test.sh micro

Your output should look something like this in the end:

#### STEP 11: Cleanup: merge parallel steps output files
Saved results to /home/ecoevo01/TOGA/TOGA/micro_test_out
Done! Estimated time: 0:01:40.027813
Program finished with exit code 0

JH567521	463144	506100	ENST00000262455.1169	1000	-	463144	506100	0,200,255	8	102,103,142,112,117,58,116,185,0,1982,30295,31351,36911,38566,41322,42771,
JH567521	395878	449234	ENST00000259400.1169	1000	+	395878	449234	0,0,200	7	123,66,226,116,51,87,240,	0,11871,38544,45802,45994,52305,53116,
JH567521	299723	336583	ENST00000618101.1169	1000	+	299723	336583	0,0,200	7	28,923,130,173,200,179,248,	0,1256,6085,6677,19146,21311,36612,
Success!

Input preparation

TOGA uses at least 3 different input data:

Gene annotation of the reference genome, in our case from L.hispanica, as a bed-12 file
Genome alignment between the reference (L.hispanica) and query genome (L.pustulata) as a chain file
Reference and query genome sequences as 2bit files

We need to create the required inputs for TOGA from our assemblies. This require different tools, including lastz, axtChain, chainNet, chainSort, chainCleaner, faToTwoBit, gff3ToGenePred, GenePredToBed, RepeatFiller and some self-written scripts. The tools from UCSC (such as faToTwoBit, axtChain,..) are pre-compiled and can be downloaded here using the command below. Please note that you need to find the correct name for the tools that are displayed on that website, for example lastz-1.04.00 for lastz.

rsync://hgdownload.soe.ucsc.edu/genome/admin/exe/linux.x86_64/<tool_name> ./

We have downloaded and installed all of them for you Add the following folder to you variable $PATH by adding this line to your ~/.bashrc file:

export PATH=/share/gluster/Projects/vinh/ecoevo/src/for_toga:$PATH

Then, apply the change with source ~/.bashrc (or restart the terminal) and re-activate toga conda environment.

Now, we can generate the input files for TOGA.

First of all, we convert our assemblies from L. pustulata and L. hispanica into 2bit files using faToTwoBit:

faToTwoBit L_hispanica_assembly.fasta L_hispanica.2bit
faToTwoBit L_pustulata_assembly.fasta L_pustulata.2bit

As the next step, we do the whole-genome alignment with the workflow mentioned in Pippel et. al. 2020 that requires different steps and tools

Do genome alignment with lastz. This will take about 6 hours, please use Slurm with 1 CPU and 4GB Ram.

While lastz is running, you can do step 2c and 3

lastz L_hispanica.2bit L_pustulata.2bit --action:target=multiple --hspthresh=2400 --gappedthresh=3000 --format=axt --out=hispanica_pustulata_aln.axt

Compute chain file from axt alignments using axtChain

axtChain hispanica_pustulata_aln.axt L_hispanica.2bit L_pustulata.2bit hispanica_pustulata_aln.chain -linearGap=medium

Get length of each sequence in the L_hispanica and L_pustulata assembly using BASH or the scripts below. You will need to:

python /share/gluster/Projects/vinh/ecoevo/src/for_toga/fasta_len.py <LasHis.fa> <LasHis.sizes>
python /share/gluster/Projects/vinh/ecoevo/src/for_toga/fasta_len.py <LasPus.fa> <LasPus.sizes>
# make sure that the identifiers in your .sizes files match the ones in your .axt file

Compile chain nets with chainNet

# run chainSort
chainSort hispanica_pustulata_aln.chain hispanica_pustulata_aln.sorted.chain
# run chainNet
chainNet hispanica_pustulata_aln.sorted.chain L_hispanica.sizes L_pustulata.sizes hispanica.net pustulata.net

Improve genome alignment using chainCleaner

chainCleaner hispanica_pustulata_aln.chain L_hispanica.2bit L_pustulata.2bit hispanica_pustulata_aln.cleaned.chain hispanica_pustulata_aln.bed -net=hispanica.net linearGap=medium

Comprehensively align repetitive genomic regions with RepeatFiller (using Slurm again with 1 CPU and 4GB memory. This will take about 1.5 hours)

RepeatFiller.py -c hispanica_pustulata_aln.cleaned.chain -T2 L_hispanica.2bit -Q2 L_pustulata.2bit | tee hispanica_pustulata_aln.cleaned.filled.chain

The output from RepeatFiller (saved in hispanica_pustulata_aln.cleaned.filled.chain) is the chain file that we want to use as an input for TOGA. Sometimes the RepeatFiller output includes empty lines which must be deleted. Use the following command:
```
sed -i '/^$/d' hispanica_pustulata_aln.cleaned.filled.chain
```

Convert the annotation gff3 file of the reference genome L. hispanica into bed file using gff3ToGenePred and GenePredToBed
```
gff3ToGenePred L_hispanica.gff3 L_hispanica.genePred
genePredToBed L_hispanica.genePred L_hispanica.bed
```
After those steps, you will have a lot of files. We suggest that you copy these 4 input files for TOGA to an additional folder before running the tool:
1. The annotation in bed format L_hispanica.bed
2. The genome alignment as chain file hispanica_pustulata_aln.cleaned.filled.chain
3. 2 genome sequences in 2bit format L_hispanica.2bit and L_pustulata.2bit

Run TOGA

Use a Slurm script with 12 CPUs and 8GB memory to run TOGA

python3 /path/to/your/downloaded/TOGA/toga.py hispanica_pustulata_aln.cleaned.filled.chain L_hispanica.bed L_hispanica.2bit L_pustulata.2bit

TOGA produces different output files that you can look up here. The most interesting output file for us is loss_summ_data.tsv, which contains the classifications for each projection, transcript, and gene. TOGA identifies the following classes:

    N - no data due to technical reasons (such as CESAR memory requirements)
    PG - no orthologous chains identified, TOGA projected transcripts via paralogous chains and cannot make any conclusion
    PM - partial & missing. Most of the projection lies outside scaffold borders
    L - clearly lost (several inactivating mutations)
    M - missing, assembly gaps mask >50% of the prediction CDS
    G - "grey", there are inactivating mutations but not enough evidence for "clearly lost" class. In other words: neither lost nor intact
    PI - partially intact: some fraction of CDS is missing, but most likely this is intact
    I - clearly intact
    UL - uncertain loss (few inactivating mutations localized in a single exon)

We are interested in the genes that are present in L. hispanica but lost in L. pustulata not by technical artifacts (e.g. errors from gene annotation process).

Search for TRANSCRIPTs and identify all gene IDs that are marked as L (lost), M (missing), PM (partial missing) or UL (uncertain loss) in the file loss_summ_data.tsv. These four sets of genes will form our genes of interest
Use genome viewer of g-nom to investigate some of those genes and their gene neighborhood in the L. hispanica contig assembly

Task 2 - Phylogenetic profile analysis with fDOG

From our previous analysis, we obtained a list of genes, which are missing in L. pustulata but present in L. hispanica. But,

What are they, or, what are their characteristics?
How old are they, or do they loss only in L. pustulata or also in other taxonomic clades? In other words, are they L. pustulata specific losses or L. hispanica specific gains?

In order to answer those questions, we are going to study their phylogenetic profiles across a wide range of species by using fDOG and PhyloProfile. fDOG is a directed ortholog search tool, which can look for orthologs of a single gene of interest in a set of search taxa. These search organisms can be diverse, from closely related to distantly taxa to our reference species, from which the seed genes are originated.

The taxon set we will use for this analysis consists of L. hispanica, L. hispanica, the close related species U. muehlenbergii, and the set of 78 QfO taxa (what is the “QfO”? Why do we work with these 78 taxa?).

Install fDOG and PhyloProfile

fDOG has been installed in the environment /home/vinh/anaconda3/envs/ecoevo. And PhyloProfile has probably already been installed after your analysis with fCAT.

Perform fDOG search

Collecting data

Seed proteins: Identify the 4 different types of candidate L. hispanica proteins and save each sequence in a separate fasta file - each group in their own directory. They will be the seed genes/proteins for the fDOG search. Hint: You need the protein file of L. hispanica from Funannotate to get the sequences!
```
# please find the solution by yourself ;) You will need at least these functions: for, grep, cut and less / cat
```

Taxon data (you can check the wiki of fDOG to understand the data types as well as the data structure of fDOG):

Create your own folder for storing the data for fDOG
```
mkdir /your/working/directory/fdog_data
```

Add U. muehlenbergii to that data folder

fdog.addTaxon -f /share/gluster/CoreSets/lecanoromycetes/genome_dir/UmbMu@87280@1/UmbMu@87280@1.fa -i 87280 -v ecoevo22 -c -a --cpus 4 --replace -o /path/to/your/fdog_data

Check in your fDOG data folder (can be found at the end of fdog.addTaxon command above) if you can find UMBMU@87280@ecoevo22 in genome_dir and blast_dir and if those folder are not empty. Then, add the FAS annotation for this species to the weight_dir folder
```
mkdir /your/fdog_data/weight_dir
ln -s /share/gluster/CoreSets/lecanoromycetes/weight_dir/UmbMu@87280@1.json /your/fdog_data/weight_dir/UMBMU@87280@ecoevo22.json
```

Do the same for your own L. hispanica protein set, as well as the protein set of L. pustulata that you have downloaded from NCBI

fdog.addTaxon -f /path/to/your/l_hispanica_proteins.fa -i 580046 -v ecoevo22 -c -a --cpus 4 --replace -o /path/to/your/fdog_data
ln -s /path/to/your/FAS/annotation/of/l_hispanica.json /path/to/your/fdog_data/weight_dir/LASHI@580046@ecoevo22.json
fdog.addTaxon -f /path/to/your/l_pustulata_proteins.fa -i 136370 -v ecoevo22 -c -a --cpus 4 --replace -o /path/to/your/fdog_data
ln -s /path/to/your/FAS/annotation/of/l_pustulata.json /path/to/your/fdog_data/weight_dir/LASPU@136370@ecoevo22.json

We also need the data for the 78 QfO taxa

cd /your/fdog_data/
ln -s /share/gluster/Projects/vinh/ecoevo/fdog_data/weight_dir/* weight_dir/
ln -s /share/gluster/Projects/vinh/ecoevo/fdog_data/genome_dir/* genome_dir/
ln -s /share/gluster/Projects/vinh/ecoevo/fdog_data/blast_dir/* blast_dir/

Now you can run fDOG (using Slurm with 8 CPUs and 8GB memory). Please make sure that the input folder for fdogs.run contains nothing but only the respective fasta files of L. hispanica candidate genes you got from step 1a

Check the manual of fDOG for the meaning of each option you are using here

# add one command like this for each group of candidate genes to your SLURM script (or you can make 4 SLURM scripts, each for one seed directory)
fdogs.run --input /path/to/folder/containing/seed/proteins --jobName <your-job-name> --refspec LASHI@580046@ecoevo22 --blastpath /path/to/your/fdog/data/blast_dir --searchpath /path/to/your/fdog/data/genome_dir --weightpath /path/to/your/fdog/data/weight_dir --CorecheckCoorthologsRef --checkCoorthologsRef --force --cpu 8

The output for fDOG will be your-job-name.extended.fa, your-job-name.phyloprofile, your-job-name_forward.domains and your-job-name_reverse.domains. Now you can study them using PhyloProfile tool
1. Upload *.phyloprofile and *_forward.domains into PhyloProfile
2. Select Lasallia hispanica as the reference taxon and plot the profiles
3. Apply clustering to bring the similar profiles together
4. You can use the function “Gene age estimation” to estimate the evolutionary age of your candidate proteins
5. In the “Detailed plot” you can find a link to UniProt database, where you can get the the characteristics of the proteins
6. Also from the “Detailed plot” you can generate the domain architecture plot to see the comparison between your protein of interest and the ortholog reference protein from L. hispanica

OPTIONAL but highly recommended

This is an example for analysing one specific gene of interest

Searching for DHFR (Dihydrofolate reductase) in L. pustulata and L. hispanica. The story behinds this analysis is that, this protein has been observed to be present in L. hispanica but absent in L. pustulata. Here we are trying to check this finding using fDOG.
First, we need a DHFR protein as the seed sequence. Please search for this protein in human using the UniProt database. Get the UniProt ID of this protein and check if you have it in the human gene set of your fDOG data. The human gene set of fDOG can be found at /path/to/your/fdog/data/genome_dir/HUMAN@9606@3. Get this protein sequence and save it as a file in FASTA format. It will be the input seed sequence for fDOG run.
Make sure that you have already added L. pustulata and L. hispanica to fDOG data set (check for LASHI@580046@ecoevo22 and LASPU@136370@ecoevo22 in /your/fdog_data/genome_dir, blast_dir and weight_dir)

Now we can run fDOG with the DHFR protein. NOTE: for one seed sequence, we need fdog.run, not fdogs.run

fdog.run --seqFile /path/to/human/DHFR.fasta --seqName dhfr --refspec HUMAN@9606@3 --blastpath /path/to/your/fdog_data/blast_dir --searchpath /path/to/your/fdog_data/genome_dir  --weightpath /path/to/your/fdog_data/weight_dir --CorecheckCoorthologsRef --checkCoorthologsRef --force --cpu 8

Finally, check the phylogenetic profile for DHFR protein if you can find any ortholog in L. hispanica and in L. pustulata

Tools

menus and quick search

quick search

site status

Page Tools

meta data for this page