Differences

This shows you the differences between two versions of the page.

--- general:bioseqanalysis:genesetanalysis:busco [2022/03/09 19:18] – ingo
+++ general:bioseqanalysis:genesetanalysis:busco [2024/02/14 14:24] (current) – [Busco Results] ingo
@@ Line 7: / Line 7: @@
 <caption><fs 0.8em>BUSCO set categories as provided via the [[https://busco.ezlab.org/|BUSCO web pages]]. For each category several, more specialized BUSCO sets are available.
 </fs></caption></figure>
-The way to compile [[https://busco-data.ezlab.org/v4/data/lineages/|these catalogs]] is straightforward: Take a set of species, let's say fungi, such that the full phylogenetic diversity of this systematic group is represented. Use a standard ortholog search tool, such as [[https://www.orthodb.org/|OrthoDB]] or [[https://omabrowser.org/oma/home/|OMA]], and identify orthologous groups. Subsequently, extract those orthologous groups which harbor for most (>90%) of the species in your taxon set exactly one sequence. For the corresponding genes you now conclude the following
+The way to compile [[https://busco-data.ezlab.org/v5/data/lineages/|these catalogs]] is straightforward: In brief, take a set of species, let's say fungi, such that the full phylogenetic diversity of this systematic group is represented. Use a standard ortholog search tool, such as [[https://www.orthodb.org/|OrthoDB]] or [[https://omabrowser.org/oma/home/|OMA]], and identify orthologous groups. Subsequently, extract those orthologous groups which harbor for most (>90%) of the species in your taxon set exactly one sequence. For the corresponding genes you now conclude the following
   * the gene is at least as old as the last common ancestor of your taxon collection
   * since the gene is represented in most of the individual species, it seems to be so essential that it is typically not subjected to lineage-specific loss, hence you think its //universal//
   * since most species are represented only by a single sequence, the gene is typically not subjected to a lineage specific duplication. In other words, you have identified a gene with //universal single copy orthologs//
 Busco then performs a number of downstream post-processing steps, such as assessing the length variation of the proteins within each Busco group. On this basis, the tool later classifies whether a particular gene is represented completely or only partially in a test gene set.
+For a more thorough introduction, please refer to the corresponding [[https://busco.ezlab.org/busco_userguide.html#lineage-datasets|BUSCO webpages]].
 -----
-==== Task set ====
+===== Outline =====
 **What you need**
-  * The set of annotated proteins for your genome assembly((you can use any prediction)): ''/home/ubuntu/Share/AnalysisResults/GeneAnnotation/braker2/augustus.hints_iter1.aa''
+  * The set of annotated proteins for your genome assembly((you can use any prediction)):
+    * /home/ubuntu/Share/Analysis/GeneAnnotation/Results/braker2/crypto_BCM2021_v2.braker2.proteins
+    * /home/ubuntu/Share/Analysis/GeneAnnotation/Results/metaeuk/Crypto_Metaeuk.fas
   * **optionally** the genome assembly in FASTA format: ''/home/ubuntu/Share/Assemblies/crypto_BCM2021_v2.fasta''
@@ Line 26: / Line 30: @@
     * missing
-=== Running BUSCO ===
+===== Running BUSCO =====
 <wrap important>Busco can be run on different input data</wrap>
   * directly on the genome assembly
   * on transcript sets
-  * on protein sets.
+  * on protein sets
-While the analysis on the genome sequence level has the advantage of being independent from the sensitivity of a preceding gene prediction, it is substantially more computationally intense. In this course, we will therefore restrict the analysis to the set of predicted proteins. We have preinstalled //Busco// for you in the **/home/ubuntu/anaconda3/envs/compgen** environment.
+While the analysis on the genome sequence level has the advantage of being independent from the sensitivity of a preceding gene prediction, it is substantially more computationally intense. In this course, we will therefore restrict the analysis to the set of predicted proteins. We have preinstalled //Busco// for you in the **/home/ubuntu/miniconda3/envs/busco** environment.
   - make sure that Busco is installed by typing<WRAP>
 <code>
 conda deactivate
-conda activate compgen
+conda activate busco
 busco -h
 </code>
-</WRAP>If Busco is not installed, please perform the installation via the [[physaliacg:software|conda package management system]]. <wrap important></wrap>If Busco is asking to downgrade packages in your compgen environment, it is a good idea to stop the installation, create a separate environment for Busco and start the installation in this environment. Otherwise, this likely leads to inconsistencies in your environment.
+</WRAP>
-  - create a sub-directory **//busco//** in your project directory: <code>mkdir -p $HOME/Analysis/busco</code>
+If Busco is not installed, please perform the installation via the [[physaliacg:software|conda package management system]].
+  - create a sub-directory **//busco//** in your project directory<WRAP>
+<code>mkdir -p $HOME/Analysis/busco</code>
+</WRAP>
   - change into the new directory: <code>cd $HOME/Analysis/busco</code>
   - make a directory **//data//** that will then take up all data for your subsequent analyses:<code>mkdir data</code>
@@ Line 47: / Line 54: @@
   - check the available datasets from Busco by typing<WRAP>
 <code>busco --list-datasets</code>
-</WRAP>and check for the sets //alveolata// and //eukaryota//
+</WRAP>
+and check for the sets //alveolata// and //eukaryota//
   - run the Busco analysis by typing for example<WRAP>
-<code>busco -i predsResults.fas -c 1 -o Busco_metaeuk_alveolata -m prot -l alveolata
+<code>busco -i Crypto_Metaeuk.fas -c 1 -o Busco_metaeuk_eukaryota -m prot -l eukaryota
 </code>
-</WRAP>Note, the option **//-c 1//** tells Busco to use only 1 cpu core for the analysis. On our cloud, the number of processors is limited, but feel free to increase this number on your own system.
+</WRAP>
-  - Monitor the outcome of your analysis. <wrap hint></wrap>The output directory is quite nested. The interesting files are located for the //ubuntu user// in ''/home/ubuntu/Shared/busco/busco_crypto_alveolata/run_alveolata_odb10''  What do you conclude from the findings? Pay particular attention to the number of missing, partial and duplicated Busco genes, and **compare the results from the //eukaryota// and the //alveolata// data sets**.<WRAP>
+Note, the option ''-c 1'' tells Busco to use only 1 cpu core for the analysis. On our cloud, the number of processors is limited, but feel free to increase this number on your own system.
+===== Busco Results =====
+  - Monitor the outcome of your analysis. <wrap hint></wrap>The output directory is quite nested. The interesting files are located for the //ubuntu user// in ''/home/ubuntu/Share/Analysis/busco/braker2/Busco_braker2_eukaryota/run_eukaryota_odb10/''
+  - What do you conclude from the findings? Pay particular attention to the number of missing, partial and duplicated Busco genes, and **compare the results from the //eukaryota// and the //alveolata// data sets**.<WRAP>
 <hidden Spoiler>
-<WRAP spoiler>
 <code>
-	--------------------------------------------------
+    ---------------------------------------------------
-	|Results from dataset alveolata_odb10             |
+    |Results from dataset alveolata_odb10              |
-	--------------------------------------------------
+    ---------------------------------------------------
-	|C:99.4%[S:99.4%,D:0.0%],F:0.6%,M:0.0%,n:171      |
+    |C:100.0%[S:100.0%,D:0.0%],F:0.0%,M:0.0%,n:171     |
-	|170	Complete BUSCOs (C)                       |
+    |171    Complete BUSCOs (C)                        |
-	|170	Complete and single-copy BUSCOs (S)       |
+    |171    Complete and single-copy BUSCOs (S)        |
-	|0	Complete and duplicated BUSCOs (D)        |
+    |0    Complete and duplicated BUSCOs (D)           |
-	|1	Fragmented BUSCOs (F)                     |
+    |0    Fragmented BUSCOs (F)                        |
-	|0	Missing BUSCOs (M)                        |
+    |0    Missing BUSCOs (M)                           |
-	|171	Total BUSCO groups searched               |
+    |171    Total BUSCO groups searched                |
-	--------------------------------------------------
+    ---------------------------------------------------
+    ---------------------------------------------------
+    |Results from dataset eukaryota_odb10              |
+    ---------------------------------------------------
+    |C:51.0%[S:50.6%,D:0.4%],F:8.2%,M:40.8%,n:255      |
+    |130    Complete BUSCOs (C)                        |
+    |129    Complete and single-copy BUSCOs (S)        |
+    |1    Complete and duplicated BUSCOs (D)           |
+    |21    Fragmented BUSCOs (F)                       |
+    |104    Missing BUSCOs (M)                         |
+    |255    Total BUSCO groups searched                |
+    ---------------------------------------------------
 </code>
-</WRAP>
 </hidden>
 </WRAP>
-    - <wrap important></wrap>If you want to look up the orthologous group behind a BUSCO group, you have to look up at [[https://www.orthodb.org/|OrthoDB]]((see the example in the hidden section above)). For the checking the gene in the assembly, however, check the fasta files in the directory ''fragmented_busco_sequences''. You will find the deviating gene from your assembly there. Use its identifier to search in the web browser.<WRAP>
+  - <wrap important></wrap>If you want to look up the orthologous group behind a BUSCO group, you have to look up at [[https://v10-1.orthodb.org/|OrthoDB]]((see the example in the hidden section above)). For looking up a gene in the assembly, check the fasta files in the directory ''fragmented_busco_sequences''. You will find the deviating gene from your assembly there. Use its identifier to search in the web browser.<WRAP>
-<hidden YouArePatient?>
+</WRAP>
--) You are patient?? Good, then let's wait for the fCAT analysis
+  - **Optional** repeat the analysis with the genome sequence as input, but :?: do you think that this makes sense for our purpose? Keep also in mind that BUSCO search in an unannotated genome sequence is computationally more demanding than the search in protein sequences((Why?))
-</hidden>
-    - **Optional** repeat the analysis with the genome sequence as input, but :?: do you think that this makes sense for our purpose? Keep also in mind that BUSCO search in an unannotated genome sequence is computationally more demanding than the search in protein sequences((Why?))
+===== Final remarks =====
 <WRAP round box>
-==== Final remarks ====
 Busco is common and valuable for assessing the completeness of a genome in a standardized manner. However, one should keep in mind that the results should not be over-interpreted.
   * Busco sets can be considerably small, in other words you test for the presence of only a small set of the entire gene set. Thus, you should spend more than one thought about whether or not it is feasible to generalize the insights from the Busco analysis to the entire gene set.

Tools

menus and quick search

quick search

site status

Page Tools

meta data for this page

Differences