Assessing the completeness of sets of proteins, transcripts, but also of genomes is a routine task in genome-scale analyses. The amount of missing gene, i.e. of genes that are expected to be present, but which are not, is used as a standard quality criterion, where more missing genes are interpreted as lower quality data. BUSCO, a standard software for such analyses uses a set of nearly ubiquitously represented single copy genes as the underlying data set. A unidirectional search determines then if sufficiently similar sequences exist in the test set such that the gene can be considered present. (Read here for more on BUSCO).
BUSCO leaves open some gaps that we address with the tool fCAT. There are four main advantages of fCAT.
fCAT is still in beta testing, so we will use you a bit as guinea pigs to try out the software. We hope that you like it
For the course, we have prepared one core set of proteins that are prevalent1) in eukaryotes
This core set uses group identifier from OrthoDB v10
Species | NCBI Taxonomy ID | kingdom | Internal name |
---|---|---|---|
Cryptococcus neoformans | 215684 | fungi | CRYNE@214684@2 |
Rhizopus delemar | 246409 | fungi | RHIDE@246409@2 |
Chlamydomonas reinhardtii | 3055 | chlorophyta | CHLRE@3055@2 |
Arabidopis thaliana | 3702 | streptophyta | ARATH@3702@2 |
Amphimedon queenslandica | 400682 | metazoa | AMPQU@400682@2 |
Nematostella vectensis | 45351 | metazoa | NEMVE@45351@2 |
Sorghum bicolor | 4558 | streptophyta | SORBI@4558@2 |
Caenorhabditis elegans | 6239 | metazoa | CAEEL@6239@2 |
Homo sapiens | 9606 | metazoa | HOMSA@9606@2 |
Zymoseptoria tritici | 336722 | fungi | ZYMTR@336722@2 |
Saccharomyces cerevisiae | 559292 | fungi | SACCE@559292@2 |
Drosophila melanogaster | 7227 | metazoa | DROME@7227@2 |
Prior to your fCAT analysis, you should make sure that the software is installed, and that all necessary files are present and in the correct format.
conda activate fdog
fcat -h
COILSDIR
echo $COILSDIR
mkdir -p $HOME/Analysis/fcat
cd $HOME/Analysis/fcat
ln -sf /home/ubuntu/Share/Analysis/GeneAnnotation/Results/metaeuk/Crypto_Metaeuk.fas .
cd $HOME/Analysis/fcat cp -r /home/ubuntu/Share/ProteinSets/coredir/annotation_dir .
fdog.addTaxon -f Crypto_Metaeuk.fas -n Crypa_metaeuk -i 5807 -o $HOME/Analysis/fCAT --annopath $HOME/Analysis/fcat/annotation_dir/ --replace
The second command will annotate protein features, such as PFAM and SMART domains, low complexity regions, transmembrane domains, etc. If you do not want to run the annotation, which will take a couple of minutes using 8 cores, copy it from /home/ubuntu/Share/ProteinSets/fcat/CRYPA_METAEUK2@5807@240209.json
into $HOME/Analysis/fcat/annotation_dir/
To perform the fCAT analysis, perform the following steps
fcat --coreDir $HOME/Share/ProteinSets/coredir/ --coreSet eukaryota --refspecList "HOMSA@9606@2" --querySpecies Crypto_Metaeuk.fas --taxid 5807 --annoQuery $HOME/Analysis/fcat/annotation_dir/CRYPA_METAEUK\@5807\@240209.json
The analysis will run for about 380 sec when using 4 cores.
fCAT in combination with PhyloProfile allows to visualize and explore the results of the geneset completeness analysis. Follow the steps below to download the data to your local computer and
to open it in PhyloProfile.
Download the following three files from the fcat output folder, e.g. $HOME/Analyses/fcat/fcatOutput/eukaryota/CRYPA@5807@250408/phyloprofileOutput
for the eukaryota dataset.
open the results for the eukaryota dataset in PhyloProfile. To do so, perform the following steps
Remember, you are doing this analysis because you want to know
Inspect
missing.txt
in your fCat output folderCustomised profile
missing.txt
Download the information about the missing genes. We will need this for the last analysis
Follow the links below to
Use the following links to navigate through the course