The first indicator for the quality of your sequencing data is the per base sequence quality of your raw reads. Often you will see a decreasing quality with increasing base position just as in the FASTQC image below (Fig. 1). But what is the reason for this and what are the consequences?
Figure 1: Per base sequence quality control with typical decrease of the quality over the read.
First of all, don’t panic! It is a normal and well known phenomenon. The reason of the decreasing sequence quality lies in the sequencing technology of Illumina.
Illumina relies on the sequencing by synthesis procedure. During each cycle of the process the sequencer washes chemicals that include variants for all four nucleotide over the flow cell (which has different clusters with identical DNA fragments for each cluster). The nucleotides have a blocker (terminator cap) so that only 1 base gets added to each molecule of DNA at a time. After the detection of the coupled fluorescence signal the blocker can be removed and the cycle can start again. This way, the DNA fragments in each cluster get sequenced synchronously by expressing specific fluorescence signals.
During the sequencing process different errors can occur. The main reason for the decreasing sequence quality is the so-called phasing. Phasing means that the blocker of a nucleotide is not correctly removed after signal detection. In the next cycle no new nucleotide can bind on this DNA fragment and the old nucleotide is detected one more time whereby the fluorescence signal of this old nucleotide (probably) differs from the synchronous signal of the other nucleotides (Fig. 2). From now on this DNA fragment will be 1 cycle behind the rest (out of phase), polluting the light signal that the sequencer's camera has to read. A similar effect occurs if a nucleotide has a defect terminator cap (prephasing). In this case two nucleotides can bind in one cycle whereby the fragment will be 1 cycle before the rest.
These errors occur with a low probability. But over time (with increasing read length) they add up and pollute the light signal more and more. The signal gets more and more asynchronous. And since the light signal is used to calculate quality scores the asynchronous signal results in a decreasing sequence quality score.
As we now know the decreasing base sequence quality is due to a unwanted but unavoidable process. It limits the length of high quality reads. New chemicals are largely intended to minimize the phasing problem, increasing the length of reads before quality begins to decrease.
ecSeq is a bioinformatics solution provider with solid expertise in the analysis of high-throughput sequencing data. We can help you to get the most out of your sequencing experiments by developing data analysis strategies and expert consulting. We organize public workshops and conduct on-site trainings on NGS data analysis.
Last updated on January 20, 2017