All DNA sequencing approaches are similar in a way that the sequencer has to interpret a signal that indicates the presence of a given nucleotide at a given position. The resulting raw data serves then as input into a software, the basecaller, that does the actual read out. In other words, it translates the signal into a nucleotide sequence.
The base calling is obviously not error free, as some signals suffer from a lot of noise (Trace A in figure 1). Others are pretty clear and easy to interpret (Trace B in figure 1). Note, Trace A and B cover the same stretch of the template sequence. As the four nucleotides cannot take up any information about the confidence in a certain base call, base qualities values have been introduced to propagate information about the sequence quality to downstream applications. The general procedure is outlined, exemplarily for Sanger sequencing in figure 1, but the same principle applies to other sequencing technologies.
Base qualities are traditionally represented as the negative decadic logarithm of the error probability of a base call.
Base qualities for individual sequencing reads typically range between 0 and maximally 40, although some technologies slightly differ in the range (Figure 2). Base qualities higher than 40 are typically only assigned to consensus sequences that are support from overlapping regions of more than one read.
Traditionally, base quality values have been stored as integer numbers. As DNA sequencing became cheaper, and, as a consequence, the amount of generated sequence data started to grow exponentially, it became common to store the information more memory efficient as ASCII characters (Fig. 2).
Quality values used to be relevant for all applications of DNA sequencing. Nowadays, however, DNA sequencing has become cheap to an extent that read coverage, i.e. the number of reads covering a nucleotide in your template sequence, has taken away quite a bit of their importance, when it comes to the de-novo sequencing of genomes.
When it comes to applications where a uniformly high coverage is not guaranteed, e.g. when sequencing transcriptomes, or when one is looking for genetic variants, e.g. in the context of genotyping, then base qualities are still highly relevant.
Base quality values are stored differently, depending on the sequence file format.
The FASTA format cannot store sequence and base quality information within one file. In this case, each FASTA sequence file is accompanied by a corresponding base quality file. The headers of the corresponding sequence and base quality string must then be identical. Moreover, there must be a one-to-one relationship between nucleotides in the sequence and base quality values in the quality string.
The FASTQ format stores sequence and quality information within one file. See figure 3 for an example.