meta data for this page
This is an old revision of the document!
Gene set completeness with fCAT
Assessing the completeness of sets of proteins, transcripts, but also of genomes is a routine task in genome-scale analyses. The amount of missing gene, i.e. of genes that are expected to be present, but which are not, is used as a standard quality criterion, where more missing genes are interpreted as lower quality data. BUSCO, a standard software for such analyses uses a set of nearly ubiquitously represented single copy genes as the underlying data set. A unidirectional search determines then if sufficiently similar sequences exist in the test set such that the gene can be considered present. (Read here for more on BUSCO).
BUSCO leaves open some gaps that we address with the tool fCAT. There are four main advantages of fCAT.
- users can design their own set of core genes
- fCAT uses ortholog assignments instead of only a unidirectional search, and because of is not restricted to single copy genes
- fCAT has four scoring modes (still in beta) that consider next to length differences also differences in the domain architecture of the orthologs
- the results can be visually interpreted and integrated with the presence/absence pattern of the orthologs in other taxa
fCAT is still in beta testing, so we will use you a bit as guinea pigs to try out the software. We hope that you like it
fCAT Core sets
For the course, we have prepared one core set of proteins that are prevalent1) in eukaryotes
Core set eukaryota
This core set uses group identifier from OrthoDB v10
Species | NCBI Taxonomy ID | kingdom | Internal name |
---|---|---|---|
Cryptococcus neoformans | 215684 | fungi | CRYNE@214684@2 |
Rhizopus delemar | 246409 | fungi | RHIDE@246409@2 |
Chlamydomonas reinhardtii | 3055 | chlorophyta | CHLRE@3055@2 |
Arabidopis thaliana | 3702 | streptophyta | ARATH@3702@2 |
Amphimedon queenslandica | 400682 | metazoa | AMPQU@400682@2 |
Nematostella vectensis | 45351 | metazoa | NEMVE@45351@2 |
Sorghum bicolor | 4558 | streptophyta | SORBI@4558@2 |
Caenorhabditis elegans | 6239 | metazoa | CAEEL@6239@2 |
Homo sapiens | 9606 | metazoa | HOMSA@9606@2 |
Zymoseptoria tritici | 336722 | fungi | ZYMTR@336722@2 |
Saccharomyces cerevisiae | 559292 | fungi | SACCE@559292@2 |
Drosophila melanogaster | 7227 | metazoa | DROME@7227@2 |
Preparing the fCAT run
Prior to your fCAT analysis, you should make sure that the software is installed, and that all necessary files are present and in the correct format.
Checking the software
- confirm that fCAT is installed on the system
- activate the conda environment:
conda activate fdog
- type
fcat -h
- If fCAT is not installed follow the installation guide lines
- check for the correct setting of the environmental variable
COILSDIR
echo $COILSDIR
- If this does not provide you with a path, then this variable is not set and fdog will not run properly.
Preparing the fCAT run
- create a working directory for your fCAT analysis:
mkdir -p $HOME/Analysis/fcat
- change into the new directory:
cd $HOME/Analysis/fcat
- soft-link2) the protein file you want to analyse into your working directory:
ln -sf /home/ubuntu/Share/Analysis/GeneAnnotation/Results/metaeuk/Crypto_Metaeuk.fas .
- We have already preformed this step for you.
- copy the directory harbouring the feature architecture annotations into your Analysis directory
cd $HOME/Analysis/fcat cp -r /home/ubuntu/Share/ProteinSets/coredir/annotation_dir .
- Annotate the protein domains in the gene set of interest:
fdog.addTaxon -f Crypto_Metaeuk.fas -n Crypa_metaeuk -i 5807 -o $HOME/Analysis/fCAT --annopath $HOME/Analysis/fcat/annotation_dir/ --replace
The second command will annotate protein features, such as PFAM and SMART domains, low complexity regions, transmembrane domains, etc. If you do not want to run the annotation, which will take a couple of minutes using 8 cores, copy it from
/home/ubuntu/Share/ProteinSets/fcat/CRYPA_METAEUK2@5807@240209.json
into $HOME/Analysis/fcat/annotation_dir/
Running fCAT
To perform the fCAT analysis, perform the following steps
- run the fCAT analysis on the AWS with the following core set
- eukaryota
- for the eukaryota core set and the MetaEuk gene prediction on the CryPa_BCM2021a assembly, invoke the analysis with the following command3):
fcat --coreDir $HOME/Share/ProteinSets/coredir/ --coreSet eukaryota --refspecList "HOMSA@9606@2" --querySpecies Crypto_Metaeuk.fas --taxid 5807 --annoQuery $HOME/Analysis/fcat/annotation_dir/CRYPA_METAEUK\@5807\@240209.json
The analysis will run for about 380 sec when using 4 cores.
fCAT analysis - Output visualization and interpretation
fCAT in combination with PhyloProfile allows to visualize and explore the results of the geneset completeness analysis. Follow the steps below to download the data to your local computer and
to open it in PhyloProfile.
Downloading the data
Download the following three files from the fcat output folder, e.g. $HOME/Analyses/fcat/fcatOutput/eukaryota/CRYHO@237895@220307/phyloprofileOutput
for the eukaryota dataset.
- *.phyloprofile
These files contains the information about the presence/absence of orthologs to the genes in your coreset together with the domain architecture similarity scores. You will find the information for both your taxon of interest and the core taxa. It is the main input file for PhyloProfile.
Choose the one that is represents the fCAT scoring mode you are interested in.
- *.mod.fa
This file contains the sequences of the orthologs in FASTA format
- *.domains
This file contains the feature annotations for the core genes and the orthologs. You will need this for visualization of the feature architectures in PhyloProfile
Opening the data in PhyloProfile
open the results for the eukaryota dataset in PhyloProfile. To do so, perform the following steps
- open a shell on your local computer
- startup R by typing R
- in the R terminal load the package PhyloProfile: library (PhyloProfile)
- start PhyloProfile: runPhyloProfile()
- now, the start page of PhyloProfile should open in your default browser
- upload the file *.phyloprofile4) in the top left of the page.
PhyloProfile will check the input.
- If PhyloProfile sees a taxon for the first time, it will ask you to fetch the information from the NCBI taxonomy database. Once this is completed, please reload the page and upload the data again
- upload the *domains file into the field at the lower left
- specify the origin of group IDs you are using
- Dataset alveolata: select OMA
- Dataset eukaryota: select OrthoDB
- plot the results by clicking on ‘’Plot’’
Exploring the data
Remember, you are doing this analysis because you want to know
which genes are missing
how the missing genes are represented in the core species
which genes display an unexpected domain architecture
Inspect
- the overall pattern of the phylogenetic profiles and reconcile this with the text output of fCAT
- you can click on individual dots in the profile to gain more information about the detected orthologs. This gives you the option to look up sequence and orthogroup information in the public database5), and you can expect the domain architectures of the seed protein and the respective ortholog.
- check out the tab ‘’Functions’’ in the top menu. It gives you, among others, the option to cluster your phylogenetic profiles based on a variety of distance measures. Try this!6)
- go back to the clustering function and use the mouse to select a clade in the clustering graph7). You will find that the corresponding genes appear in a table to the right. Check the box ‘’Add to custom plot’’ and inspect your selection in tab custom profile
- redo the selection, this time selecting all genes from the eukaryota dataset that are present in all core species but are absent in your C. parvum gene set8)
- if you do not find a single clade comprising all the genes that are missing in C. parvum do the following:
- Look for the file
missing.txt
in your fCat output folder - go to the tab
Customised profile
- upload the file
missing.txt
- select Homo sapiens as the taxon of interest9)
Download the data for the next analysis step
Download the information about the missing genes. We will need this for the last analysis
- in the custom plot, select only Homo sapiens to be shown
- go to the ‘’Download’’ tab and download the information in FASTA format
- inspect the file, and if all is good, upload it to your home directory at the AWS10)
Navigation
Follow the links below to
Use the following links to navigate through the course