meta data for this page
  •  

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
general:bioseqanalysis:shotgunassembly [2021/10/19 21:02] ingogeneral:bioseqanalysis:shotgunassembly [2021/10/19 21:48] (current) – [Method selection] ingo
Line 17: Line 17:
  
 <figure KMER2> <figure KMER2>
-{{ :ecoevo_molevol:wiki:mee:figures:kmer2.png?400 |}}+{{ :ecoevo_molevol:wiki:mee:figures:kmer2.png?600 |}}
 <caption><fs 0.8em>The kmer coverage histogram depends on the chosen value of k. The left plot shows the histogram resulting from a fungal-algal metagenome for a k of 151. The genomic kmers from the alga overlap in frequency with the kmers introduced by the sequencing error. They are hence ignored during genome assembly. It is for this reason that the assembly covers only 40% of the metagenome. The right plot gives the kmer histogram for the same data, this time using a k of 51. The algal kmer frequency is now clearly separated from the kmers introduced by the sequencing error, and are thus used for the genome reconstruction. The resulting assembly, though having a lower contiguity (N50 = 22 Kb), covers now almost the entire metagenome.</fs></caption></figure> <caption><fs 0.8em>The kmer coverage histogram depends on the chosen value of k. The left plot shows the histogram resulting from a fungal-algal metagenome for a k of 151. The genomic kmers from the alga overlap in frequency with the kmers introduced by the sequencing error. They are hence ignored during genome assembly. It is for this reason that the assembly covers only 40% of the metagenome. The right plot gives the kmer histogram for the same data, this time using a k of 51. The algal kmer frequency is now clearly separated from the kmers introduced by the sequencing error, and are thus used for the genome reconstruction. The resulting assembly, though having a lower contiguity (N50 = 22 Kb), covers now almost the entire metagenome.</fs></caption></figure>
  
Line 27: Line 27:
 Meanwhile a plethora of different WGS assemblers exist, and it is hard to decide a priori which assembler performs best for a given genome and WGS data set. However, determining how good an assembly is, can be very difficult and there’s even a competition – the Assemblathon – which intends to benchmark current state-of-the-art methods in genome assembly (Earl, et al. 2011; Bradnam, et al. 2013). Still the problem exists, to what extent the insights from these benchmarks can be generalized to any particular assembly problem. Given the complexity of the assembly problem, it is easily conceivable that an algorithm that performs non-optimal on any of the benchmark data sets happens to be superior for your particular assembly problem. It is, thus, that separate benchmarks are generated for particular subsets of genomes (e.g. Abbas, et al. 2014). As an alternative, Greshake et al. (2016) recently proposed the idea of simulated twin sets. The idea here is, to simulate a WGS read set that closely resembles that of the actual sequencing experiment, hence its name ‘twin set’. Assemblers can then be custom-benchmarked on the twin sets resulting in a more informed assembler choice.  Meanwhile a plethora of different WGS assemblers exist, and it is hard to decide a priori which assembler performs best for a given genome and WGS data set. However, determining how good an assembly is, can be very difficult and there’s even a competition – the Assemblathon – which intends to benchmark current state-of-the-art methods in genome assembly (Earl, et al. 2011; Bradnam, et al. 2013). Still the problem exists, to what extent the insights from these benchmarks can be generalized to any particular assembly problem. Given the complexity of the assembly problem, it is easily conceivable that an algorithm that performs non-optimal on any of the benchmark data sets happens to be superior for your particular assembly problem. It is, thus, that separate benchmarks are generated for particular subsets of genomes (e.g. Abbas, et al. 2014). As an alternative, Greshake et al. (2016) recently proposed the idea of simulated twin sets. The idea here is, to simulate a WGS read set that closely resembles that of the actual sequencing experiment, hence its name ‘twin set’. Assemblers can then be custom-benchmarked on the twin sets resulting in a more informed assembler choice. 
  
-In our exercises on de novo whole genome shotgun assembly, we will concentrate on SPades (Bankevich, et al. 2012). SPades constructs multi-sized de Bruijn graphs with different values for //k// But of course, you are free to further explore the methods space.+===== Task list =====
        
 +<WRAP tabs> 
 +   * [[ecoevo_molevoll:topics:genome_assembly|Task list]] 
 +</WRAP>