meta data for this page
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
general:bioseqanalysis:genesetanalysis:busco [2022/03/07 17:38] – [Task set] ingo | general:bioseqanalysis:genesetanalysis:busco [2024/02/14 14:24] (current) – [Busco Results] ingo | ||
---|---|---|---|
Line 7: | Line 7: | ||
< | < | ||
</ | </ | ||
- | The way to compile [[https:// | + | The way to compile [[https:// |
* the gene is at least as old as the last common ancestor of your taxon collection | * the gene is at least as old as the last common ancestor of your taxon collection | ||
* since the gene is represented in most of the individual species, it seems to be so essential that it is typically not subjected to lineage-specific loss, hence you think its // | * since the gene is represented in most of the individual species, it seems to be so essential that it is typically not subjected to lineage-specific loss, hence you think its // | ||
* since most species are represented only by a single sequence, the gene is typically not subjected to a lineage specific duplication. In other words, you have identified a gene with //universal single copy orthologs// | * since most species are represented only by a single sequence, the gene is typically not subjected to a lineage specific duplication. In other words, you have identified a gene with //universal single copy orthologs// | ||
Busco then performs a number of downstream post-processing steps, such as assessing the length variation of the proteins within each Busco group. On this basis, the tool later classifies whether a particular gene is represented completely or only partially in a test gene set. | Busco then performs a number of downstream post-processing steps, such as assessing the length variation of the proteins within each Busco group. On this basis, the tool later classifies whether a particular gene is represented completely or only partially in a test gene set. | ||
+ | |||
+ | For a more thorough introduction, | ||
----- | ----- | ||
- | ==== Task set ==== | + | ===== Outline ===== |
**What you need** | **What you need** | ||
- | * The set of annotated proteins for your genome assembly((you can use any prediction)): | + | * The set of annotated proteins for your genome assembly((you can use any prediction)): |
+ | * / | ||
+ | * / | ||
* **optionally** the genome assembly in FASTA format: ''/ | * **optionally** the genome assembly in FASTA format: ''/ | ||
Line 26: | Line 30: | ||
* missing | * missing | ||
+ | ===== Running BUSCO ===== | ||
<wrap important> | <wrap important> | ||
* directly on the genome assembly | * directly on the genome assembly | ||
* on transcript sets | * on transcript sets | ||
- | * on protein sets. | + | * on protein sets |
- | While the analysis on the genome sequence level has the advantage of being independent from the sensitivity of a preceding gene prediction, it is substantially more computationally intense. In this course, we will therefore restrict the analysis to the set of predicted proteins. We have preinstalled //Busco// for you in the **/ | + | While the analysis on the genome sequence level has the advantage of being independent from the sensitivity of a preceding gene prediction, it is substantially more computationally intense. In this course, we will therefore restrict the analysis to the set of predicted proteins. We have preinstalled //Busco// for you in the **/ |
- make sure that Busco is installed by typing< | - make sure that Busco is installed by typing< | ||
< | < | ||
conda deactivate | conda deactivate | ||
- | conda activate | + | conda activate |
busco -h | busco -h | ||
</ | </ | ||
- | </ | + | </ |
- | - create a sub-directory **// | + | If Busco is not installed, please perform the installation via the [[physaliacg: |
+ | - create a sub-directory **// | ||
+ | < | ||
+ | </WRAP> | ||
- change into the new directory: < | - change into the new directory: < | ||
- make a directory **// | - make a directory **// | ||
Line 46: | Line 54: | ||
- check the available datasets from Busco by typing< | - check the available datasets from Busco by typing< | ||
< | < | ||
- | </ | + | </ |
+ | and check for the sets // | ||
- run the Busco analysis by typing for example< | - run the Busco analysis by typing for example< | ||
- | < | + | < |
</ | </ | ||
- | </ | + | </ |
- | - Monitor the outcome of your analysis. <wrap hint></ | + | Note, the option |
+ | |||
+ | ===== Busco Results ===== | ||
+ | - Monitor the outcome of your analysis. <wrap hint></ | ||
+ | - What do you conclude from the findings? Pay particular attention to the number of missing, partial and duplicated Busco genes, and **compare the results from the // | ||
<hidden Spoiler> | <hidden Spoiler> | ||
- | <WRAP spoiler> | ||
< | < | ||
- | -------------------------------------------------- | + | --------------------------------------------------- |
- | |Results from dataset alveolata_odb10 | + | |Results from dataset alveolata_odb10 |
- | -------------------------------------------------- | + | --------------------------------------------------- |
- | |C:99.4%[S:99.4%, | + | |C:100.0%[S:100.0%, |
- | |170 Complete BUSCOs (C) | + | |171 |
- | |170 Complete and single-copy BUSCOs (S) | + | |171 |
- | |0 Complete and duplicated BUSCOs (D) | | + | |0 Complete and duplicated BUSCOs (D) |
- | |1 Fragmented BUSCOs (F) | + | |0 |
- | |0 Missing BUSCOs (M) | | + | |0 Missing BUSCOs (M) |
- | |171 Total BUSCO groups searched | + | |171 Total BUSCO groups searched |
- | -------------------------------------------------- | + | --------------------------------------------------- |
+ | --------------------------------------------------- | ||
+ | |Results from dataset eukaryota_odb10 | ||
+ | --------------------------------------------------- | ||
+ | |C: | ||
+ | |130 Complete BUSCOs (C) | | ||
+ | |129 Complete and single-copy BUSCOs (S) | | ||
+ | |1 Complete and duplicated BUSCOs (D) | | ||
+ | |21 Fragmented BUSCOs (F) | | ||
+ | |104 Missing BUSCOs (M) | | ||
+ | |255 Total BUSCO groups searched | ||
+ | --------------------------------------------------- | ||
</ | </ | ||
- | </ | ||
</ | </ | ||
</ | </ | ||
- | | + | |
- | - **Optional** repeat the analysis with the genome sequence as input, but :?: do you think that this makes sense for our purpose? Keep also in mind that BUSCO search in an unannotated genome sequence is computationally more demanding than the search in protein sequences((Why? | + | </ |
+ | - **Optional** repeat the analysis with the genome sequence as input, but :?: do you think that this makes sense for our purpose? Keep also in mind that BUSCO search in an unannotated genome sequence is computationally more demanding than the search in protein sequences((Why? | ||
+ | |||
+ | ===== Final remarks ===== | ||
<WRAP round box> | <WRAP round box> | ||
- | ==== Final remarks ==== | ||
Busco is common and valuable for assessing the completeness of a genome in a standardized manner. However, one should keep in mind that the results should not be over-interpreted. | Busco is common and valuable for assessing the completeness of a genome in a standardized manner. However, one should keep in mind that the results should not be over-interpreted. | ||
* Busco sets can be considerably small, in other words you test for the presence of only a small set of the entire gene set. Thus, you should spend more than one thought about whether or not it is feasible to generalize the insights from the Busco analysis to the entire gene set. | * Busco sets can be considerably small, in other words you test for the presence of only a small set of the entire gene set. Thus, you should spend more than one thought about whether or not it is feasible to generalize the insights from the Busco analysis to the entire gene set. | ||
Line 82: | Line 106: | ||
The use of Busco is not limited to just assessing the completeness of gene set reconstructions (or genome assemblies). Instead, it can provide valuable information for the initial training of a gene prediction software, and thus shows up now and then in protocols that concentrate on genome annotation. | The use of Busco is not limited to just assessing the completeness of gene set reconstructions (or genome assemblies). Instead, it can provide valuable information for the initial training of a gene prediction software, and thus shows up now and then in protocols that concentrate on genome annotation. | ||
</ | </ | ||
+ | |||
+ | ---- | ||
+ | <WRAP tabs> | ||
+ | * [[: | ||
+ | * [[: | ||
+ | * [[: | ||
+ | </ | ||
+ | |