meta data for this page
  •  

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
general:bioseqanalysis:genesetanalysis:fcat [2024/02/14 15:17] – [Checking the software] ingogeneral:bioseqanalysis:genesetanalysis:fcat [2025/04/08 16:34] (current) – [fCAT analysis - Output visualization and interpretation] ingo
Line 10: Line 10:
 </WRAP> </WRAP>
 ===== fCAT Core sets ===== ===== fCAT Core sets =====
-For the course, we have prepared two core sets of proteins that are prevalent((missing in less than 10% of the core taxa)) in **eukaryotes**, and in all **alveolates**. The latter set is obviously closer to //C. parvum//+For the course, we have prepared one core set of proteins that are prevalent((missing in less than 10% of the core taxa)) in **eukaryotes**
  
 ==== Core set eukaryota ==== ==== Core set eukaryota ====
Line 46: Line 46:
 To solve this issue temporarily for the current shell, type<WRAP> To solve this issue temporarily for the current shell, type<WRAP>
 <code> <code>
-export COILSDIR=/home/ubuntu/tools/annotation_tools/COILS2/coils+export COILSDIR=/home/ubuntu/Share/fdog/annotation_tools/COILS2/coils
 </code>To fix this change, add this line to your bash configuration file ''.bashrc''. It will become active upon the next login, or by typing ''source ~/.bashrc''. <wrap important>After 'sourcing' the ''.bashrc'', you will have to re-activate the conda environment: ''conda activate /home/ubuntu/anaconda3/envs/fdog''</wrap> </code>To fix this change, add this line to your bash configuration file ''.bashrc''. It will become active upon the next login, or by typing ''source ~/.bashrc''. <wrap important>After 'sourcing' the ''.bashrc'', you will have to re-activate the conda environment: ''conda activate /home/ubuntu/anaconda3/envs/fdog''</wrap>
 </WRAP> </WRAP>
Line 54: Line 54:
   - create a working directory for your fCAT analysis: <code>mkdir -p $HOME/Analysis/fcat</code>   - create a working directory for your fCAT analysis: <code>mkdir -p $HOME/Analysis/fcat</code>
   - change into the new directory: <code>cd $HOME/Analysis/fcat</code>   - change into the new directory: <code>cd $HOME/Analysis/fcat</code>
-  - copy the directory harbouring the feature architecture annotations into your Analysis directory<WRAP>+  - soft-link((ln -s)) the protein file you want to analyse into your working directory: <code>ln -sf /home/ubuntu/Share/Analysis/GeneAnnotation/Results/metaeuk/Crypto_Metaeuk.fas .</code> 
 +  - <wrap important>We have already preformed this step for you.</wrap> 
 +    - copy the directory harbouring the feature architecture annotations into your Analysis directory<WRAP>
 <code> <code>
 cd $HOME/Analysis/fcat cd $HOME/Analysis/fcat
 cp -r /home/ubuntu/Share/ProteinSets/coredir/annotation_dir . cp -r /home/ubuntu/Share/ProteinSets/coredir/annotation_dir .
-</code> +</code></WRAP
-  - soft-link((ln -s)) the protein file you want to analyse into your working directory: <code>ln -sf /home/ubuntu/Share/Analysis/GeneAnnotation/Results/metaeuk/Crypto_Metaeuk.fas .</code+    - Annotate the protein domains in the gene set of interest:<WRAP>
-  - Annotate the protein domains in the gene set of interest:<WRAP>+
 <code> <code>
 fdog.addTaxon -f Crypto_Metaeuk.fas -n Crypa_metaeuk -i 5807 -o $HOME/Analysis/fCAT --annopath $HOME/Analysis/fcat/annotation_dir/ --replace fdog.addTaxon -f Crypto_Metaeuk.fas -n Crypa_metaeuk -i 5807 -o $HOME/Analysis/fCAT --annopath $HOME/Analysis/fcat/annotation_dir/ --replace
Line 74: Line 75:
   - run the fCAT analysis on the AWS with the following core set   - run the fCAT analysis on the AWS with the following core set
     - eukaryota     - eukaryota
-  - for the **eukaryota core set** and the MetaEuk gene prediction on the CryPa_BCM2021a assembly, invoke the analysis with the following command((This assumes that you are in $HOME/Analysis/fCAT)): <code>fcat --coreDir ../../../Share/ProteinSets/coredir/ --coreSet eukaryota --refspecList "HOMSA@9606@2" --querySpecies Crypto_Metaeuk.fas --taxid 5807 --annoQuery ../../../Share/ProteinSets/coredir/annotation_dir/CRYPA_METAEUK@5807@240209.json</code>:!: The analysis will run for about **380 sec** when using 4 cores.<WRAP>+  - for the **eukaryota core set** and the MetaEuk gene prediction on the CryPa_BCM2021a assembly, invoke the analysis with the following command((This assumes that you are in $HOME/Analysis/fcat)): <code>fcat --coreDir $HOME/Share/ProteinSets/coredir/ --coreSet eukaryota --refspecList "HOMSA@9606@2" --querySpecies Crypto_Metaeuk.fas --taxid 5807 --annoQuery $HOME/Analysis/fcat/annotation_dir/CRYPA_METAEUK\@5807\@240209.json </code>:!: The analysis will run for about **380 sec** when using 4 cores.<WRAP>
 <hidden Spoiler> <hidden Spoiler>
 <code> <code>
Line 80: Line 81:
 Mode 1: Mode 1:
 genomeID similar dissimilar duplicated missing ignored total genomeID similar dissimilar duplicated missing ignored total
-CRYPA@5807@240206 149 89 0 86 8 333+CRYPA@5807@240206 149 89 0 86 8 332
  
 Mode 2: Mode 2:
 genomeID similar dissimilar duplicated missing ignored total genomeID similar dissimilar duplicated missing ignored total
-CRYPA@5807@240206 141 97 0 86 8 333+CRYPA@5807@240206 141 97 0 86 8 332
  
 Mode 3: Mode 3:
 genomeID similar dissimilar duplicated missing ignored total genomeID similar dissimilar duplicated missing ignored total
-CRYPA@5807@240206 215 23 0 86 8 333+CRYPA@5807@240206 215 23 0 86 8 332
  
 Mode 4: Mode 4:
 genomeID complete fragmented duplicated missing ignored total genomeID complete fragmented duplicated missing ignored total
-CRYPA@5807@240206 217 21 0 86 8 333+CRYPA@5807@240206 217 21 0 86 8 332
  
 </code> </code>
Line 100: Line 101:
  
 ==== fCAT analysis - Output visualization and interpretation ==== ==== fCAT analysis - Output visualization and interpretation ====
-fCAT in combination with PhyloProfile allows to visualize and explore the results of the geneset completeness analysis. Follow the steps below to :!:  {{ :physaliacg:2024:crypa_bcm2021a_metaeuk_fcat-eukaryota.tar.gz |download the data}} to your local computer and :!: to open it in PhyloProfile.+fCAT in combination with [[https://bioconductor.org/packages/release/bioc/html/PhyloProfile.html|PhyloProfile]] allows to visualize and explore the results of the geneset completeness analysis. Follow the steps below to :!:  {{ :physaliacg:2025:data:CRYPA_Metaeuk-fcat.tar.gz|download the data}} to your local computer and :!: to open it in PhyloProfile.
 <hidden PrecomputedFiles> <hidden PrecomputedFiles>
 You will find all pre-computed fCAT results at ''/home/ubuntu/Share/Analysis/fCAT/fcatOutput/eukaryota''. Use these, if your analysis did not complete in time. You will find all pre-computed fCAT results at ''/home/ubuntu/Share/Analysis/fCAT/fcatOutput/eukaryota''. Use these, if your analysis did not complete in time.
 </hidden> </hidden>
 === Downloading the data === === Downloading the data ===
-Download the following three files from the fcat output folder, e.g. ''$HOME/Analyses/fcat/fcatOutput/eukaryota/CRYHO@237895@220307/phyloprofileOutput'' for the //eukaryota// dataset.+Download the following three files from the fcat output folder, e.g. ''$HOME/Analyses/fcat/fcatOutput/eukaryota/CRYPA@5807@250408/phyloprofileOutput'' for the //eukaryota// dataset.
   - *.phyloprofile :!: These files contains the information about the presence/absence of orthologs to the genes in your coreset together with the domain architecture similarity scores. You will find the information for both your taxon of interest **and** the core taxa. **It is the main input file for PhyloProfile**. :!: Choose the one that is represents the fCAT scoring mode you are interested in.   - *.phyloprofile :!: These files contains the information about the presence/absence of orthologs to the genes in your coreset together with the domain architecture similarity scores. You will find the information for both your taxon of interest **and** the core taxa. **It is the main input file for PhyloProfile**. :!: Choose the one that is represents the fCAT scoring mode you are interested in.
   - *.mod.fa :!: This file contains the sequences of the orthologs in FASTA format   - *.mod.fa :!: This file contains the sequences of the orthologs in FASTA format
   - *.domains :!: This file contains the feature annotations for the core genes and the orthologs. You will need this for visualization of the feature architectures in PhyloProfile   - *.domains :!: This file contains the feature annotations for the core genes and the orthologs. You will need this for visualization of the feature architectures in PhyloProfile
 === Opening the data in PhyloProfile === === Opening the data in PhyloProfile ===
-open the results for the //eukaryota// dataset in PhyloProfile. To do so, perform the following steps+open the results for the //eukaryota// dataset in [[https://bioconductor.org/packages/release/bioc/html/PhyloProfile.html|PhyloProfile]]. To do so, perform the following steps
   - open a shell on your local computer   - open a shell on your local computer
   - startup //**R**// by typing R   - startup //**R**// by typing R
Line 120: Line 121:
   - upload the *domains file into the field at the lower left   - upload the *domains file into the field at the lower left
   - specify the origin of group IDs you are using   - specify the origin of group IDs you are using
-    - Dataset //alveolata//: select **OMA** 
     - Dataset //eukaryota//: select **OrthoDB**     - Dataset //eukaryota//: select **OrthoDB**
   - plot the results by clicking on ‘’Plot’’   - plot the results by clicking on ‘’Plot’’
Line 132: Line 132:
   - you can click on individual dots in the profile to gain more information about the detected orthologs. This gives you the option to look up sequence and orthogroup information in the public database((of course, this is possible only for groups and sequences for which a public database entry exists. Currently, we support OMA, orthoDB and NCBI)), and you can expect the domain architectures of the seed protein and the respective ortholog.   - you can click on individual dots in the profile to gain more information about the detected orthologs. This gives you the option to look up sequence and orthogroup information in the public database((of course, this is possible only for groups and sequences for which a public database entry exists. Currently, we support OMA, orthoDB and NCBI)), and you can expect the domain architectures of the seed protein and the respective ortholog.
   - check out the tab ‘’Functions’’ in the top menu. It gives you, among others, the option to cluster your phylogenetic profiles based on a variety of distance measures. Try this!((you will have to recheck the box ‘’Sort sequences by ID’’ in the PhyloProfile landing page, though))   - check out the tab ‘’Functions’’ in the top menu. It gives you, among others, the option to cluster your phylogenetic profiles based on a variety of distance measures. Try this!((you will have to recheck the box ‘’Sort sequences by ID’’ in the PhyloProfile landing page, though))
-    * once your data is clustered, check the box ‘’apply clustering to main plot’’ and inspect the sorted phyloprofile 
   - go back to the clustering function and use the mouse to select a clade in the clustering graph((you may have to increase its height)). You will find that the corresponding genes appear in a table to the right. Check the box ‘’Add to custom plot’’ and inspect your selection in tab custom profile   - go back to the clustering function and use the mouse to select a clade in the clustering graph((you may have to increase its height)). You will find that the corresponding genes appear in a table to the right. Check the box ‘’Add to custom plot’’ and inspect your selection in tab custom profile
   - redo the selection, this time selecting all genes from the //eukaryota// dataset that are present in all core species but are absent in your //C. parvum// gene set((This requires some experimenting to find the correct clade in the tree, unfortunately))   - redo the selection, this time selecting all genes from the //eukaryota// dataset that are present in all core species but are absent in your //C. parvum// gene set((This requires some experimenting to find the correct clade in the tree, unfortunately))
 +  - if you do not find a single clade comprising all the genes that are missing in //C. parvum// do the following:
 +    - Look for the file ''{{ :physaliacg:2025:data:crypa_metaeuk-fcat_missing.txt.gz |missing.txt}}'' in your fCat output folder
 +    - go to the tab ''Customised profile''
 +    - find the button to upload a gene list for selecting a gene set of interest<WRAP>
 +<figure PhyloProfile>
 +{{:general:bioseqanalysis:images:phyloprofile-custom.png?400|}}
 +</figure>
 +</WRAP>
 +    - upload the file ''missing.txt''
 +    - select //Homo sapiens// as the taxon of interest((you can play around with the selection of taxa))
 === Download the data for the next analysis step === === Download the data for the next analysis step ===
 Download the information about the missing genes. We will need this for the last analysis Download the information about the missing genes. We will need this for the last analysis