Important file formats for NGS data analysis

The interpretation and analysis of Next-Generation Sequencing (NGS) data requires several specialized file formats, each of which serves a distinct purpose. This article provides an overview of the most important file formats in NGS data analysis: FASTA, FASTQ, SAM/BAM/CRAM, BED/GTF, and bedgraph.

Plain Sequences: FASTA

The FASTA format, named after the FASTA sequence alignment software package, is primarily used for storing nucleotide sequences or amino acid sequences. Each FASTA file starts with a single-line description - the header line - preceded by a ">" symbol. Following the description, the sequence data are written in subsequent lines, typically line-wrapped after 80 characters. While simplicity and human-readability are the primary advantages of the FASTA format, it normally doesn't hold any additional information.

Sequences with Quality Scores: FASTQ

The FASTQ format stores both nucleotide sequences and their corresponding quality scores. Each record in a FASTQ file contains four lines: the sequence identifier with an optional description (starting with a "@" symbol), the raw sequence, a separator line (usually beginning with a "+" symbol), and the sequence quality scores. The quality scores are encoded as ASCII characters, each representing the likelihood of a sequencing error at that particular base in PHRED-scores. The output from sequencing machines typically comes in FASTQ format.

Alignments: SAM/BAM/CRAM

SAM/BAM/CRAM Sequence Alignment/Map (SAM), Binary Alignment/Map (BAM), and Compression Alignment/Map (CRAM) are formats designed to store sequence alignment information. SAM is a tab-delimited text format, human-readable, containing alignment information and additional metadata. BAM is the binary equivalent of SAM, containing the same information in a compressed, binary form that allows rapid processing and reduced storage space. CRAM is a more recent development, which further compresses the alignment information by storing only the differences between the aligned sequences and a reference sequence. This drastically reduces the necessary storage space, but to use it, you always require access to the reference sequence.

Genomic Annotations: BED/GTF

Browser Extensible Data (BED) and Gene Transfer Format (GTF) are formats used for storing gene and feature annotations. A BED file is a tab-delimited text file that defines data lines, each representing a distinct feature (like a gene or transcript) with fields for the chromosomal coordinates and additional annotations. The GTF format is similar to the BED but provides a more structured format and additional fields. It's specifically designed to hold data related to genomic features such as exons, genes, and transcripts and their corresponding locations.

Coverage Data: bedgraph

The bedgraph format is used to store continuous-valued data, such as gene expression levels or coverage data across the genome. This format is a variant of the BED format and allows for efficient representation of large-scale numeric data sets. Like the BED format, it is also a tab-delimited text file, with each line defining a chromosomal region (e.g., chromosome, start, end) and an associated continuous value.

In conclusion, an understanding of these file formats is critical for NGS data analysis. Each format serves its unique function in the analysis workflow.

Receive updates about NGS articles and trainings

Share this article

About us

ecSeq is a bioinformatics solution provider with solid expertise in the analysis of high-throughput sequencing data. We can help you to get the most out of your sequencing experiments by developing data analysis strategies and expert consulting. We organize public workshops and conduct on-site trainings on NGS data analysis.

Last updated on June 15, 2023