This website uses cookies in order to improve our services. If you proceed visiting this website you accept the usage of cookies. For more info please read our Data Privacy statement.

 

Next Generation Sequencing Glossary: Methods, Terms and Definitions

This website represents a glossary that briefly introduces different methods and applications in next generation sequencing. Additionally, common used terms related to next generation sequencing are explained.

A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z
 

Adaptor

Adaptors are short oligo nucleotides that are used in next generation sequencing. These adaptors are fused to the ends of the target DNA, this can be done by ligation or by amplification with specific primers that contain the adaptor sequence within the primer sequence. The adaptors are needed for DNA template enrichment (amplification) and for sequencing. Depending on the used sequencing technology the adaptors can be coupled to complementary oligos on beads or can be bound on flow cells. The adaptor sequences differ between sequencing platforms.
 

Alignment (Sequence alignment)

Sequence alignment in bioinformatics is the arranging of different (two or multiple) sequences of DNA, RNA or proteins in order to identify similarities or differences.
Alignment is a crucial step during NGS data analysis that is used for SNP detection, variant analysis, genotyping etc. In order to discriminate the new sequence you need to align with a reference sequence with known genotype and/ or phenotype. For alignment you can use sequencing reads, contigs or complete genome sequences.

 

Allele

An allele is an alternative form of a gene. Eukaryotes have two sets of chromosomes and each chromosome can bear a different form of a gene thus the organism would have two different alleles for a gene. If both alleles on both chromosomes are the same the organism is homozygous for the gene if the alleles are different the organism is heterozygous for the gene.
If an organisms has two different alleles for a gene this can result in different phenotypes for example in pigmentation. But, different alleles does not necessarily lead to different phenotypes.

 

Amplicon

By utilization of different primer pairs with multiple barcodes, amplicon sequencing can be used for parallel amplification of multiple targeted regions or genes such as in targeted gene panels for heritability and cancer screening and diagnostics.

 

Amplicon Sequencing

An amplicon is a DNA fragment that was amplified by polymerase chain reaction (PCR) or any other process leading to the production of different copies of the fragment. Amplification (duplication of genes or genome sequences) can take place natural or can be induced artificial.

Amplicon sequencing is based on amplification (copying) of DNA fragments (gene or genome regions of interest) prior sequencing. The amplification is usually performed by PCR.
Amplicon sequencing enhances the sensitivity of next generation sequencing. This is important if NGS is applied in diagnostics. Furthermore, amplicon sequencing enables the detection of rare genetic variants and haplotypes.
Amplicon sequencing can be used for random amplification of any DNA of interest or for targeted sequencing of specific genes or genome regions.

 

Annotation

Annotation is the attachment of data (text, comments, others) to a specific part of original data. In genetic this means to add information to a specific genome sequence. There are different levels of genome annotations. Genome annotation can mean the determination of genes, coding sequences, introns, exons or other encoding features.

 

Assembly (sequence assembly)

Sequence assembly is the alignment and merging of short sequence fragments (reads) in order to construct the original sequence. Assembly of sequence reads is based on read overlaps and alignment to a reference sequence. The reads are generated by DNA shotgun sequencing with different NGS technologies. Depending on the used NGS technology the read length can vary between 50 and 100.000 nucleotides. Factors that can influence assembling efficiency are for example: read length, read quality, coverage rate and the reference sequences. For sequencing of unknown genomes such as in pathogen detection de novo assembling is used that does not rely on reference sequences.
Different publications describe and evaluate methods and software tools for assembly, for example the assemblathon 2.


Bacterial 16S rRNA Sequencing

The 16S ribosomal RNA is a component of the 30S subunit of prokaryotic ribosomes. The RNA is encoded by the 16S ribosomal DNA. The 16S rDNA gene is relatively conserved and therefore often used for phylogenetic analysis of bacteria.
Different commercial solutions and protocols exists that target specific regions within the 16S rDNA. This enables phylogenetic classification of bacteria. These protocols are examples for targeted gene sequencing as they are based on amplification with sequence specific primers.
Most targeted based sequencing protocols for 16S rRNA sequencing amplify only parts of the 16S rRNA gene. This does not always allow conclusive classification of bacteria due to high sequence homologies between certain bacteria strains.

 

Barcode

A barcode is a machine readable script added to an object or a dataset. In genetic this means a specific oligo sequence that is added to the DNA sequence of interest. A barcode sequence can vary in length and should have at least four nucleotides, the longer the barcode the better the discrimination between barcodes as sequencing errors can occurred within barcode sequences as well. Longer barcodes enable higher multiplexing capacity. If the barcodes are too long they will reduce the length of real sequencing information obtained per read as the length of the barcode has to be removed from the target sequence, this is especially important for short read sequencing. The utilization of barcodes enable multiplexing and pooling of samples in a single sequencing library and sequencing run. Different tools exist for barcode generation and barcode based sequence data analysis. Different commercial provider offer barcodes for NGS application.

 

Bias

A bias is a particular tendency, trend, feeling or opinion. A bias can influence results unreasoned or in an unfair way. Specific facts can impact results and lead to a bias. For example asking only women in a survey will result in a clear bias in result outcome as no man were asked.
In genetics and sequencing different bias result from different workflow steps, technologies and data analysis tools. For example bias can results during targeted amplification, during fragmentation, during fragment purification, during sequencing and during data analysis. The bias can be based on the nucleic acid structure and stability such as GC content. In 454 sequencing technology problems can occur during sequencing of mono nucleic repetitive sequence regions.

 

Binning

Binning means to group a number of specific values or data into a smaller number of bins, classes or clusters. In sequencing this means to reduce the number of reads by grouping the reads or contigs and assign them to specific genome regions. Binning can be used to reduce read noise and file size. Different software tools for binning of sequencing data are available.

 

Bioinformatics

Bioinformatics is a field within informatics that develops methods, applications and software tools for analysis of biological data such as protein structure data, sequence data, statistics and many other biological data. In NGS bioinformatics will deal with storage, conversion and analysis of sequence information and sequence annotations. Bioinformatics is still one of the greatest challenges within genetics as numerous different data file formats exist, multiple data bases for data storage exist and a large number of different software tools for data analysis exist. The effective utilization of sequencing data, the free access and an effective and simple data analysis workflows are often missing.

 

Bisulfite Sequencing

Bisulfite sequencing is used for determination of methylation pattern in DNA templates by DNA treatment with bisulfite prior sequencing. DNA methylation is an important factor in epigenetics. DNA methylation takes place on the nucleotide base cytosine (C) leading to a 5-methylcytosine.

If the DNA template of interest is treated with bisulfite the cytosine is converted into uracil whereas the 5-methylcytosine is not converted. The uracil is homologe to thymidine (T) and the read out of the DNA sequence treated with bisulfite will be thymidine instead of cytosine. The untreated DNA template is sequenced as cytosine at this position.
In order to analyse the results of bisulfite treatment it is necessary to sequence the DNA sample of interest once with bisulfite treatment and once without treatment. Both sequences are compared and the SNPs on the cytosine positions are compared. These cytosines that are still the same after bisulfite treatment are methylated (protected from conversion to uracil) whereas these cytosines that are replaced by uracil (corresponding to T) are not methylated.

 

Bridge Amplification

Bride amplification is a specific methods developed and used by Illumina in order to amplify the target DNA on the flow cell in the next generation sequencing device. The method is called bridge amplification as the DNA fragment is attached with both ends via specific adaptors to the flow cell, forming a DNA bridge. This method is used to amplify the DNA target of interest in clusters on the flow cell. For better explanation you can visit the video introducing Illumina sequencing technology.


Cancer Screening

Different biomarker including genetic markers are known to be associated with the occurrence of cancer or related to an elevated risk for becoming certain cancers.
Cancer screening is used to check for certain cancer related genetic markers before a patient shows symptoms of cancer. Early detection of cancer will enhance the chances to treat and cure the cancer.
NGS is widely used for screening of genetic markers in cancer research and medicine.
It has to be considered that there are a number of biomarkers related to different cancers, genetic screening might miss some of these markers leading to false negative prognosis.
In case of showing already symptoms NGS should be used in combination with other confirmatory diagnostic tests.

 

Chip

A chip is a specific supply where the sequencing and detection takes place. Different NGS platforms use different chips that vary in size, design and in the sequencing read capacity. Do not mix up with ChIP sequencing.

 

ChIP Sequencing

Chromatin immunoprecipitation (ChIP) is a technology that utilizes the interaction of proteins with DNA in a cell. Specific proteins are known to interact with specific genomic regions, for example transcription factors bind to promoter regions.

The technology can be used in order to purify specific DNA regions that are bound to proteins. These proteins will be used for precipitation. After purification of the DNA/ protein complexes the DNA can be isolated (separated from the protein) and sequenced.
In specific applications the protein/ DNA complex will be treated by sonication or nuclease digestion. This enables sequencing of DNA regions that are protected by proteins from sharing and digestion. This method is also called DNase foot-printing assay and gives information about the exact binding sites of the proteins.

 

Consensus sequence

If a number of different overlapping sequences, sequencing reads or contigs are aligned in order to create a single master sequence, the resulting sequence is called the consensus sequence. The consensus sequence is sometimes also denoted as the most likely sequence. During alignment of different overlapping sequences you will find different nucleotides at the same position within the individual sequences. Normally, different nucleotides at the same position will be labeled with an N (unknown) in the consensus sequence according to the IUPAC code. In order to obtain a more useful consensus sequence these Ns are replaced by the nucleotide that is found most often within this position.
A consensus sequences represents only a likelihood of the real sequence. For example, when sequencing viral or bacterial populations there might be different sequences (individuals) available in the population also termed as quasi species. In this case the consensus sequence might reflect the dominant quasi species for certain genome regions or genes.

 

Cluster

A cluster is a group of similar things or items that are located closely together. In genetics and sequencing a cluster of similar DNA fragments is generated during the bridge amplification on the Illumina flow cell during the DNA template amplification step prior sequencing.

 

Contig

A contig is a single file that merges different fragment files in order to optimize files in as few fragments as possible. In genetics this means single (overlapping) reads are assembled to a larger sequence fragment called contig. This is the first level of higher structured sequences obtained during the sequencing raw data assembling. A variety of different software tools for assembly and contig generation exist.

 

Copy Number Variation (CNV)

A copy in genetics mean that an identical nucleotide sequence is available in one or more duplicates. As a result you can find certain sequences in different copy numbers in a genome. These sequence copies occur naturally and the number can vary greatly. Copy number variations can derive from different structural changes such as insertions, duplications or deletions. CNV can lead to different phenotypes and can be associated with different genetic defects leading to disorders or diseases.

 

(Sequence) Coverage

Coverage describes the rate of which a reference sequence is covered by sequencing reads of a sequencing run. There are two meaning of coverage, the first expresses the coverage of the reference sequence in percentage and can mean that the reference sequence is not completely covered by the sequencing run. This often happens while detecting unknown or known pathogens of very low titers from field samples. The second meaning is the coverage rate sometimes also called depth or coverage depth. In this case the coverage is expressed in the average number of copies (reads) each nucleotide of the reference sequence is present in the sequencing reads. The coverage rate of different regions within a reference sequence can greatly vary within one sequencing run and depends on different factors such as bias introduced during sample and library preparation or sequencing.


Pin It