meta data for this page
Whole Genome Shotgun Assembly
General outline
How to reconstruct a genome sequence from scratch based on a set of whole genome shotgun sequencing data? Traditionally, we differentiate between two strategies for DNA sequence assembly (i) the overlap layout based methods, and (ii) the de Bruijn graph based methods (Fig. 1). Overlap layout consensus approaches are rather easy to explain. Each read represents a consecutive stretch of the sequenced genome. From this follows that two overlapping reads, i.e. reads where the end of one read resembles the start of the second read, can be joined to represent a longer segment of the sequenced genome. The concept of de Bruijn graph based approaches is fundamentally different. These algorithms aim at reconstructing the shortest superstring that can be built from a list of words of length k, so called kmers. Sequencing reads, in the first approximation, serve only to identify kmers occurring in the genome of interest, and are, in principle, no longer considered as representations of consecutive stretches of the genome under study. This is of course an oversimplification, but it may help to initially understand the difference between an overlap consensus based assembler and a de Bruijn graph based assembler.

de Bruijn graph based approaches
Functional concepts
The general procedure of a de Bruijn graph based assembly is pretty similar across various approaches. Initially, the user has to decide on a length (or a range) for k. The software then compiles a kmer catalogue from the read set. The number of reads a given kmer is represented in is then called the kmer coverage (Figs. 2 and 3).


Such kmers with a low coverage – and also those with a kmer coverage below the expectation given the read coverage – will be considered as sequencing errors. To save memory, they will either be removed from the list, or alternatively are subjected to an inherent error correction. Subsequently, the initial de Bruijn graph will be generated from the remaining kmers. K-1 mers will serve as nodes in the graph, and a directed edge is drawn between two nodes once their k-2 overlap of prefix and suffix result in an observed kmer (Fig. 4). This pre-graph is then subsequently simplified by correcting for sequencing errors and repeats both resulting in reticulations of the graph, and subsequent nodes connected by unambiguous paths through the graph are condensed into longer nodes. All long nodes with lengths above a user-defined minimum contig length are finally output as contigs. Note, in individual assemblers a scaffolding procedure is directly implemented. It is, thus, advisable to make sure at what stage of the assembly procedure your program ends.

Method selection
Meanwhile a plethora of different WGS assemblers exist, and it is hard to decide a priori which assembler performs best for a given genome and WGS data set. However, determining how good an assembly is, can be very difficult and there’s even a competition – the Assemblathon – which intends to benchmark current state-of-the-art methods in genome assembly (Earl, et al. 2011; Bradnam, et al. 2013). Still the problem exists, to what extent the insights from these benchmarks can be generalized to any particular assembly problem. Given the complexity of the assembly problem, it is easily conceivable that an algorithm that performs non-optimal on any of the benchmark data sets happens to be superior for your particular assembly problem. It is, thus, that separate benchmarks are generated for particular subsets of genomes (e.g. Abbas, et al. 2014). As an alternative, Greshake et al. (2016) recently proposed the idea of simulated twin sets. The idea here is, to simulate a WGS read set that closely resembles that of the actual sequencing experiment, hence its name ‘twin set’. Assemblers can then be custom-benchmarked on the twin sets resulting in a more informed assembler choice.