meta data for this page
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
general:bioseqanalysis:shotgunassembly [2021/10/19 20:56] – ingo | general:bioseqanalysis:shotgunassembly [2021/10/19 21:48] (current) – [Method selection] ingo | ||
---|---|---|---|
Line 1: | Line 1: | ||
====== Whole Genome Shotgun Assembly ====== | ====== Whole Genome Shotgun Assembly ====== | ||
- | How to reconstruct a genome sequence from scratch based on a set of whole genome shotgun sequencing data? Traditionally, | + | ===== General outline ===== |
+ | How to reconstruct a genome sequence from scratch based on a set of whole genome shotgun sequencing data? Traditionally, | ||
<figure Assembly> | <figure Assembly> | ||
Line 7: | Line 8: | ||
General approaches to the read assembly problem. a) Sequence reads are generated from a circular (or linear) genome. b) Overlap layout based assemblers generate an overlap graph from reads whose terminal overlaps display a minimal length and sequence similarity. The overlap is considered as evidence that the reads partly cover the same section of the template genome. Word (kmer) based approaches use the sequence reads as resources to extract words that occur in the genome sequence. From the word lists graphs can be reconstructed using either the kmers as nodes < | General approaches to the read assembly problem. a) Sequence reads are generated from a circular (or linear) genome. b) Overlap layout based assemblers generate an overlap graph from reads whose terminal overlaps display a minimal length and sequence similarity. The overlap is considered as evidence that the reads partly cover the same section of the template genome. Word (kmer) based approaches use the sequence reads as resources to extract words that occur in the genome sequence. From the word lists graphs can be reconstructed using either the kmers as nodes < | ||
+ | ===== de Bruijn graph based approaches ===== | ||
+ | ==== Functional concepts ===== | ||
The general procedure of a de Bruijn graph based assembly is pretty similar across various approaches. Initially, the user has to decide on a length (or a range) for k. The software then compiles a kmer catalogue from the read set. The number of reads a given kmer is represented in is then called the kmer coverage (Figs. {{ref> | The general procedure of a de Bruijn graph based assembly is pretty similar across various approaches. Initially, the user has to decide on a length (or a range) for k. The software then compiles a kmer catalogue from the read set. The number of reads a given kmer is represented in is then called the kmer coverage (Figs. {{ref> | ||
Line 14: | Line 17: | ||
<figure KMER2> | <figure KMER2> | ||
- | {{ : | + | {{ : |
< | < | ||
- | Such kmers with a low coverage – and also those with a **kmer coverage** below the expectation given the read coverage – will be considered as sequencing errors. To save memory, they will either be removed from the list, or alternatively are subjected to an inherent error correction. Subsequently, | + | Such kmers with a low coverage – and also those with a <wrap important> |
+ | <figure Bruijn> | ||
+ | {{ : | ||
+ | < | ||
+ | ==== Method selection ==== | ||
Meanwhile a plethora of different WGS assemblers exist, and it is hard to decide a priori which assembler performs best for a given genome and WGS data set. However, determining how good an assembly is, can be very difficult and there’s even a competition – the Assemblathon – which intends to benchmark current state-of-the-art methods in genome assembly (Earl, et al. 2011; Bradnam, et al. 2013). Still the problem exists, to what extent the insights from these benchmarks can be generalized to any particular assembly problem. Given the complexity of the assembly problem, it is easily conceivable that an algorithm that performs non-optimal on any of the benchmark data sets happens to be superior for your particular assembly problem. It is, thus, that separate benchmarks are generated for particular subsets of genomes (e.g. Abbas, et al. 2014). As an alternative, | Meanwhile a plethora of different WGS assemblers exist, and it is hard to decide a priori which assembler performs best for a given genome and WGS data set. However, determining how good an assembly is, can be very difficult and there’s even a competition – the Assemblathon – which intends to benchmark current state-of-the-art methods in genome assembly (Earl, et al. 2011; Bradnam, et al. 2013). Still the problem exists, to what extent the insights from these benchmarks can be generalized to any particular assembly problem. Given the complexity of the assembly problem, it is easily conceivable that an algorithm that performs non-optimal on any of the benchmark data sets happens to be superior for your particular assembly problem. It is, thus, that separate benchmarks are generated for particular subsets of genomes (e.g. Abbas, et al. 2014). As an alternative, | ||
- | In our exercises on de novo whole genome shotgun assembly, we will concentrate on SPades (Bankevich, et al. 2012). SPades constructs multi-sized de Bruijn graphs with different values for // | + | ===== Task list ===== |
- | <figure Bruijn> | + | <WRAP tabs> |
- | {{ :ecoevo_molevol:wiki: | + | * [[ecoevo_molevoll:topics:genome_assembly|Task list]] |
- | < | + | </WRAP> |