RNA-Seq has replaced microarrays for many applications in the area of biomarker discovery. The prices have been fallen substantially in recent years. The sequence data allows to extract more information than gene expression only. And there is no requirement that a reference genome must exist. However, the analysis of the resulting data is much more challenging and requires more ressources than other approaches.
One of the most ressource-intensitve steps during a NGS data analysis is the alignment of the sequence reads to the reference genome. Therefore, a common question is about choosing the best NGS alignment tool. As we show in the referenced article, finding the best tool is not possible without in-depth examination of your use case.
Finding an optimal alignment of NGS sequence reads is already a challenging task, and for RNA sequencing data is has to be carried out millions of times. Compared to the alignment of DNA sequences, tools aligning sequences from RNA transcripts have to cope with intronic sequences that lead to large gaps in the alignment.
In order to compare different short read aligners, we use a published, real-life RNA-Seq dataset. All optimal alignments (also multiple mapping loci) of 100,000 read pairs of each sample were calculated with the full sensitivity mapping tool RazerS 3. In the benchmark shown below, we measured the performance in finding all optimal hits of different NGS mappers with default parameters. True positives are reads with up to 10 multiple mapping loci, allowing up to 10 errors (mismatches and indels). Note that we explicitely want to find all multiple mapping loci in this benchmark and not only unique mapping loci or just one random hit of several. We have used the publicly available SRR534289 dataset. Please find more information in the benchmark details here.
The following comparison addresses the question: how accurate do the tools report alignments when compared to the known truth. On-target hits means how many of the reported alignments do actually map to one of the true locations for this sequence. False positives counts the number of reported alignments that do not map to any of the true positions.
Next, we tracked the computational ressources that are beeing used by running the different tools. Note that several tools need significant more memory than a typical desktop computer has.
* The time shown includes the (for some tools dominating) index loading step, which will be less influential (or even negligible) when mapping real-life datasets (>10 Mio reads).
** By default BBMap takes as much memory as the system provides. The minimum requirement for the used genome is 24GB.
Further criteria that people commonly use for selecting an aligner are
ecSeq is a bioinformatics solution provider with solid expertise in the analysis of high-throughput sequencing data. We can help you to get the most out of your sequencing experiments by developing data analysis strategies and expert consulting. We organize public workshops and conduct on-site trainings on NGS data analysis.
Last updated on November 11, 2017