meta data for this page
  •  

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
general:bioseqanalysis:shotgunassembly [2021/10/19 21:02] ingogeneral:bioseqanalysis:shotgunassembly [2021/10/19 21:48] (current) – [Method selection] ingo
Line 17: Line 17:
  
 <figure KMER2> <figure KMER2>
-{{ :ecoevo_molevol:wiki:mee:figures:kmer2.png?400 |}}+{{ :ecoevo_molevol:wiki:mee:figures:kmer2.png?600 |}}
 <caption><fs 0.8em>The kmer coverage histogram depends on the chosen value of k. The left plot shows the histogram resulting from a fungal-algal metagenome for a k of 151. The genomic kmers from the alga overlap in frequency with the kmers introduced by the sequencing error. They are hence ignored during genome assembly. It is for this reason that the assembly covers only 40% of the metagenome. The right plot gives the kmer histogram for the same data, this time using a k of 51. The algal kmer frequency is now clearly separated from the kmers introduced by the sequencing error, and are thus used for the genome reconstruction. The resulting assembly, though having a lower contiguity (N50 = 22 Kb), covers now almost the entire metagenome.</fs></caption></figure> <caption><fs 0.8em>The kmer coverage histogram depends on the chosen value of k. The left plot shows the histogram resulting from a fungal-algal metagenome for a k of 151. The genomic kmers from the alga overlap in frequency with the kmers introduced by the sequencing error. They are hence ignored during genome assembly. It is for this reason that the assembly covers only 40% of the metagenome. The right plot gives the kmer histogram for the same data, this time using a k of 51. The algal kmer frequency is now clearly separated from the kmers introduced by the sequencing error, and are thus used for the genome reconstruction. The resulting assembly, though having a lower contiguity (N50 = 22 Kb), covers now almost the entire metagenome.</fs></caption></figure>
  
 Such kmers with a low coverage – and also those with a <wrap important>kmer coverage</wrap> below the expectation given the read coverage – will be considered as sequencing errors. To save memory, they will either be removed from the list, or alternatively are subjected to an inherent error correction. Subsequently, the initial de Bruijn graph will be generated from the remaining kmers. K-1 mers will serve as nodes in the graph, and a directed edge is drawn between two nodes once their k-2 overlap of prefix and suffix result in an observed kmer (Fig. 4). This pre-graph is then subsequently simplified by correcting for sequencing errors and repeats both resulting in reticulations of the graph, and subsequent nodes connected by unambiguous paths through the graph are condensed into longer nodes. All long nodes with lengths above a user-defined minimum contig length are finally output as contigs. Note, in individual assemblers a scaffolding procedure is directly implemented. It is, thus, advisable to make sure at what stage of the assembly procedure your program ends. Such kmers with a low coverage – and also those with a <wrap important>kmer coverage</wrap> below the expectation given the read coverage – will be considered as sequencing errors. To save memory, they will either be removed from the list, or alternatively are subjected to an inherent error correction. Subsequently, the initial de Bruijn graph will be generated from the remaining kmers. K-1 mers will serve as nodes in the graph, and a directed edge is drawn between two nodes once their k-2 overlap of prefix and suffix result in an observed kmer (Fig. 4). This pre-graph is then subsequently simplified by correcting for sequencing errors and repeats both resulting in reticulations of the graph, and subsequent nodes connected by unambiguous paths through the graph are condensed into longer nodes. All long nodes with lengths above a user-defined minimum contig length are finally output as contigs. Note, in individual assemblers a scaffolding procedure is directly implemented. It is, thus, advisable to make sure at what stage of the assembly procedure your program ends.
 +<figure Bruijn> 
 +{{ :ecoevo_molevol:wiki:mee:figures:bruijn_graph.png?600 |}} 
 +<caption><fs 0.8em>A simple de Bruijn graph using a kmer size of 3. k-1 mers serve as nodes in the graph, and an edge is drawn to connect two nodes if their k-2 terminal overlap forms a kmer that is observed in the data. It is an inherent assumption that each kmer occurs only once in the template genome.</fs></caption></figure>
 ==== Method selection ==== ==== Method selection ====
 Meanwhile a plethora of different WGS assemblers exist, and it is hard to decide a priori which assembler performs best for a given genome and WGS data set. However, determining how good an assembly is, can be very difficult and there’s even a competition – the Assemblathon – which intends to benchmark current state-of-the-art methods in genome assembly (Earl, et al. 2011; Bradnam, et al. 2013). Still the problem exists, to what extent the insights from these benchmarks can be generalized to any particular assembly problem. Given the complexity of the assembly problem, it is easily conceivable that an algorithm that performs non-optimal on any of the benchmark data sets happens to be superior for your particular assembly problem. It is, thus, that separate benchmarks are generated for particular subsets of genomes (e.g. Abbas, et al. 2014). As an alternative, Greshake et al. (2016) recently proposed the idea of simulated twin sets. The idea here is, to simulate a WGS read set that closely resembles that of the actual sequencing experiment, hence its name ‘twin set’. Assemblers can then be custom-benchmarked on the twin sets resulting in a more informed assembler choice.  Meanwhile a plethora of different WGS assemblers exist, and it is hard to decide a priori which assembler performs best for a given genome and WGS data set. However, determining how good an assembly is, can be very difficult and there’s even a competition – the Assemblathon – which intends to benchmark current state-of-the-art methods in genome assembly (Earl, et al. 2011; Bradnam, et al. 2013). Still the problem exists, to what extent the insights from these benchmarks can be generalized to any particular assembly problem. Given the complexity of the assembly problem, it is easily conceivable that an algorithm that performs non-optimal on any of the benchmark data sets happens to be superior for your particular assembly problem. It is, thus, that separate benchmarks are generated for particular subsets of genomes (e.g. Abbas, et al. 2014). As an alternative, Greshake et al. (2016) recently proposed the idea of simulated twin sets. The idea here is, to simulate a WGS read set that closely resembles that of the actual sequencing experiment, hence its name ‘twin set’. Assemblers can then be custom-benchmarked on the twin sets resulting in a more informed assembler choice. 
  
-In our exercises on de novo whole genome shotgun assembly, we will concentrate on SPades (Bankevich, et al. 2012). SPades constructs multi-sized de Bruijn graphs with different values for //k// But of course, you are free to further explore the methods space.+===== Task list =====
        
-<figure Bruijn+<WRAP tabs
-{{ :ecoevo_molevol:wiki:mee:figures:bruijn_graph.png?600 |}} +   * [[ecoevo_molevoll:topics:genome_assembly|Task list]] 
-<caption><fs 0.8em>A simple de Bruijn graph using a kmer size of 3. k-1 mers serve as nodes in the graph, and an edge is drawn to connect two nodes if their k-2 terminal overlap forms a kmer that is observed in the data. It is an inherent assumption that each kmer occurs only once in the template genome.</fs></caption></figure>+</WRAP>