Next-Generation Sequencing (NGS) has become an essential tool in modern life-science research, generating vast amounts of data that provide insights into genetic variation, gene expression, epigenetic modifications, and much more. However, analyzing NGS data can be challenging, especially for those new to the field. This article aims to provide an overview of relevant topics for those unsure of how to begin with NGS data analysis.
The first rule for a data scientist is to thoroughly understand how the data you are going to work with is generated. When analyzing NGS data, this involves first gaining a comprehensive understanding of the relevant machines and their respective sequencing methods. Today, the vast majority of sequencing data is generated by 2nd generation sequencers from vendors such as Illumina or MGI, as the used sequencing method (sequencing by synthesis, or SBS) is extremely cost-effective and accurate. However, 3rd generation sequencers, which can sequence single molecule long reads, are increasingly entering the market. Particularly noteworthy is the supplier Oxford Nanopore, which impresses with extremely long sequence lengths, real-time sequencing, exceptional portability/accessibility (being very small and inexpensive), and the ability to sequence native DNA or RNA.
Before the sequencer begins generating data, the samples need to be prepared for sequencing. This process, typically referred to as NGS library generation, involves steps such as DNA or RNA extraction, (in the case of RNA) reverse transcription, fragmentation, end-repair, adapter ligation, and amplification. The goal is to produce a library of DNA fragments (or cDNA in the case of RNA sequencing) that can be loaded onto the sequencer. NGS library generation is a crucial step that impacts the data quality and subsequent analysis steps.
Like any other field, NGS analysis requires a strong understanding of fundamental concepts and terminology. It would be challenging to understand relevant publications and communicate with other researchers without familiarity with key terms such as read, alignment, reference genome, coverage, depth, paired-end, and so on.
Today, there are many options available for the analysis of NGS data. These range from the direct use of command-line tools such as BWA, STAR, Bowtie or samtools on your local computers, to cloud-based one-click analysis platforms like BaseSpace. You can decide to use GUI tools such as CLC Genomics Workbench or Galaxy if you would like to avoid writing code. You can use bioinformatic workflow languages like nextflow or Snakemake to write scalable and sharable pipelines, or decide to write those in your programming language of choice. You can decide to use any of the integrated cloud-based informatics platforms such as SevenBridges or you direclty rely on the computational resources of the major could platforms of AWS or Google Cloud.
By understanding these general options and tools, you can select the approach that best suits your needs and technical skills. As no single approach is ideal for all types of data or research questions, you may find that a combination of different tools and methods is required.
The first step in NGS data analysis typically is to perform quality control and pre-processing of raw data. This step involves assessing the quality of the raw sequencing data, filtering out low-quality reads and trimming adapter sequences. For many NGS applications like DNA resequencing or RNA-Seq, the next step is to map the reads to a reference genome or transcriptome. It's important to select the appropriate tools based on the application and the research question being addressed. Common types of analyses include variant calling, differential gene expression analysis, epigenetic analysis, and metagenomic analysis, each with its set of specific tasks and tools.
Having a solid understanding of the various NGS data file formats is crucial for effective NGS data analysis. After some years of trial and error, the community has now settled on a number of widely used standard file formats such as FASTA, FASTQ, SAM/BAM, VCF, GTF, and others. Choose the appropriate file format to enable compatibility between different tools and platforms. It is also important to be aware of the advantages and limitations of each file format for your downstream analysis.
In the diverse and rapidly evolving field of NGS beginners and experienced researchers alike will eventually encounter issues or have questions about their analysis workflows. Luckily, there are several ways to for finding relevant information and getting help online. You can search for answers online using your favorite search engine. Important information and analysis protocols are often found in relevant publications (e.g. Briefings in Bioinformatics, Nucleic Acids Research, Nature Protocols). Therefore, it makes sense to search using specific search engines such as Google Scholar or PubMed can makes sense.
Online forums and communities can be great sources of information and support when you get stuck with problems. There are several communities where you can ask questions, share knowledge, and collaborate with other researchers in the field, including Biostars and SEQanswers. Also, many NGS data analysis tools provide comprehensive documentation, tutorials and best-practice guides. However, you should double-check that this information is up-to-date.
Collecting all the necessary information from publications, analysis protocols, and forums can be a time-intensive process, especially since it's often challenging to judge the up-to-dateness and relevance of the provided information. There have been cases where famous tools widely used just three years ago have now been superseded by alternative approaches.
A good approach to acquiring up-to-date knowledge and experience is to work with experts in the field. For example, you could join a research group as a PhD student or trainee that focuses on an interesting field or utilizes the methods you want to learn more about. Sometimes experts from bioinformatics core facilities or service providers are available for consultation, which can be helpful when you encounter difficulties.
Workshops and training courses can be a great way to learn about NGS data analysis and gain hands-on experience with different tools and workflows. In these workshops, you not only learn how to perform the analysis yourself with the help of experienced trainers, but also receive firsthand assistance with any issues you encounter and learn the little tricks that make your life as a data analyst easier. Some universities and research institutions offer internal training courses. Additionally, there are many online courses and webinars available. However, it's important to note that these online options often lack hands-on learning opportunities.
ecSeq is a bioinformatics solution provider with solid expertise in the analysis of high-throughput sequencing data. We can help you to get the most out of your sequencing experiments by developing data analysis strategies and expert consulting. We organize public workshops and conduct on-site trainings on NGS data analysis.
Last updated on June 16, 2023