meta data for this page
This is an old revision of the document!
Base quality information
All DNA sequencing approaches are similar in a way that the sequencer has to interpret a signal that indicates the presence of a given nucleotide at a given position. The resulting raw data serves then as input into a software, the basecaller, that does the actual read out. In other words, it translates the signal into a nucleotide sequence.
The base calling is obviously not error free, as some signals suffer from a lot of noise (Trace A in figure 1). Others are pretty clear and easy to interpret (Trace B in figure 1). Note, Trace A and B cover the same stretch of the template sequence. As the four nucleotides cannot take up any information about the confidence in a certain base call, base qualities values have been introduced to propagate information about the sequence quality to downstream applications. The general procedure is outlined, exemplarily for Sanger sequencing in figure 1, but the same principle applies to other sequencing technologies.

Base quality file formats
Base qualities are traditionally represented as the negative decadic logarithm of the error probability of a base call.
- Q=10 represents an error probability of one in 10
- Q=20 represents an error probability of one in 100
- Q=30 represents and error probability of one in 1000
Base qualities for individual sequencing reads typically range between 0 and maximally 40, although some technologies slightly differ in the range (Figure 2). Base qualities higher than 40 are typically only assigned to consensus sequences that are support from overlapping regions of more than one read.
Traditionally, base quality values have been stored as integer numbers. As DNA sequencing became cheaper, and, as a consequence, the amount of generated sequence data started to grow exponentially, it became common to store the information more memory efficient as ASCII characters (Fig. 2).

Relevance
Genome sequencing
Quality values used to be relevant for all applications of DNA sequencing. Nowadays, however, DNA sequencing has become cheap to an extent that read coverage, i.e. the number of reads covering a nucleotide in your template sequence, has taken away quite a bit of their importance, when it comes to the de-novo sequencing of genomes.
Transcriptomics / Genotyping
When it comes to applications where a uniformly high coverage is not guaranteed, e.g. when sequencing transcriptomes, or when one is looking for genetic variants, e.g. in the context of genotyping, then base qualities are still highly relevant.
Storing base quality information
Base quality values are stored differently, depending on the sequence file format.
The FASTA format cannot store sequence and base quality information within one file. In this case, each FASTA sequence file is accompanied by a corresponding base quality file. The headers of the corresponding sequence and base quality string must then be identical. Moreover, there must be a one-to-one relationship between nucleotides in the sequence and base quality values in the quality string.
The FASTQ format stores sequence and quality information within one file. See figure 3 for an example.