Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Read Structures

Overview

A Read Structure is a string that describes how the bases in a sequencing run should be allocated into logical reads. It serves a similar purpose to the --use-bases-mask in Illumina’s bcl-convert, but provides additional capabilities.

A Read Structure is a sequence of <number><operator> pairs (called segments). The last segment may use + instead of a number to mean “whatever bases remain.” fgumi uses the read-structure crate for parsing and validation.

Read structures are used primarily in fgumi extract to specify where UMI bases, template bases, and other sequences are located in each FASTQ read.

Operators

Five kinds of operator are supported:

OperatorNameMeaning
TTemplateReads of template (e.g. genomic DNA, RNA)
BSample BarcodeIndex sequence for sample identification
MMolecular BarcodeUMI sequence for identifying the source molecule
CCell BarcodeIndex sequence for identifying the cell (single-cell)
SSkipBases to skip or ignore (e.g. monotemplate from library prep)

Rules

  • Any number of segments >= 1 is valid
  • The length of each segment must be a positive integer >= 1, or +
  • Only the last segment in a read structure may use + for its length
  • Adjacent segments may use the same operator (e.g. 6B6B+T is valid if two sample indices are ligated separately)

Examples

Simple paired-end (2x150bp, no indices)

Per-read structures: +T, +T

Paired-end with 8bp sample index

Per-read structures: +T, 8B, +T

Paired-end with inline 6bp UMI in R1

Per-read structures: 6M+T, 8B, +T

The first 6 bases of R1 are the UMI, followed by template.

Duplex sequencing with dual barcoding and UMI + monotemplate

Per-read structures: 10M5S+T, 8B, 8B, 10M5S+T

Both R1 and R2 start with a 10bp UMI followed by 5bp of monotemplate (skipped), then template.

Single-cell with cell barcodes and UMI

Per-read structures: 5C30S5C3S8M+T, 8B, +T

R1 contains two cell barcodes separated by linker sequences, then a UMI, then template.

Formal Grammar

<read-structure>     ::= <fixed-structure> <segment>
<fixed-structure>    ::= "" | <fixed-length> <operator> <fixed-structure>
<segment>            ::= <fixed-length> <operator> | <variable-length> <operator>
<operator>           ::= "T" | "B" | "M" | "C" | "S"
<fixed-length>       ::= <non-zero-digit>{<digit>}
<variable-length>    ::= "+"
<any-length>         ::= <fixed-length> | <variable-length>
<non-zero-digit>     ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<digit>              ::= "0" | <non-zero-digit>