Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

UMI Grouping

Overview

fgumi group assigns reads that appear to come from the same original molecule to the same group by writing a shared Molecular Identifier (MI) tag. Grouping relies on template-coordinate sort order.

This page describes:

  1. How reads and templates are filtered before grouping
  2. How mapping coordinates and UMIs identify reads from the same molecule
  3. Template-coordinate sort order
  4. Cell barcode support
  5. Metrics output

Filtering Reads and Templates

A read is a single sequenced strand. A template is all reads sharing the same query name (typically a read pair).

ConceptDefinitionExample
ReadA single sequenced strand (R1 or R2)@read123/1
TemplateThe full fragment, represented by both reads in a pair@read123 includes both /1 and /2

Reads and templates are filtered before grouping to prevent splitting reads from a single molecule into separate groups.

Individual reads are filtered if:

  • Flagged as secondary (unless --include-secondary)
  • Flagged as supplementary (unless --include-supplementary)

All reads for a template are filtered if:

  • All reads for the template are unmapped (unless --allow-unmapped)
  • Any non-secondary, non-supplementary read has mapping quality < --min-map-q
  • Any UMI sequence contains one or more N bases
  • --min-umi-length is specified and the UMI does not meet the length requirement

Grouping Unmapped Reads

By default, templates where all reads are unmapped are excluded from grouping. Pass --allow-unmapped to include them. This is useful for workflows where some templates genuinely fail to align (e.g. cell-free DNA fragments that fall outside the target region) but should still be counted and may share UMIs with mapped templates from the same molecule:

fgumi group \
  --input sorted.bam \
  --output grouped.bam \
  --strategy adjacency \
  --allow-unmapped

Grouping Strategies

Grouping is performed by one of four strategies:

identity

Only reads with identical UMI sequences are grouped together. This is simpler and faster than other strategies, but should usually be avoided because sequencing errors in the UMI will split reads from the same molecule into separate groups. Useful for data exploration.

edit

Reads are clustered into groups such that each read within a group has at least one other read in the group with <= --edits differences, and there are no inter-group pairings with <= --edits differences. Effective when there are small numbers of reads per UMI, but breaks down at very high UMI coverage.

adjacency

A version of the directed adjacency method described in umi_tools that allows for errors between UMIs but only when there is a count gradient. Recommended for most simplex and CODEC workflows.

paired

Similar to adjacency but for duplex sequencing where each template has two UMIs (one from each strand). Expects UMI sequences stored in a single tag separated by a hyphen (e.g. ACGT-CCGG). Allows one UMI to be absent (e.g. ACGT- or -ACGT).

The molecular IDs produced have structure: {base}/{A|B}. For example, UMI pairs AAAA-GGGG and GGGG-AAAA map to 1/A and 1/B respectively. See Tracking Reads for details. Recommended for duplex workflows.

The edit, adjacency, and paired strategies use the --edits parameter to control matching of non-identical UMIs.

Cell Barcode Support

When processing data with cell barcodes (e.g. single-cell sequencing), reads at the same genomic position are partitioned by cell barcode before UMI assignment. This ensures that reads from different cells are never grouped together, even if they share a UMI and mapping position.

The cell barcode is read from the standard CB tag. No correction or error-handling is performed on cell barcodes — they must be corrected upstream before grouping.

Cell barcodes are detected automatically across the entire pipeline — no additional flags are needed. The consensus callers validate that all source reads in a group share the same cell barcode and propagate it to the output consensus read.

Metrics Output

fgumi group can emit three types of metrics files. They can be specified individually or all at once with the --metrics prefix flag.

The -M/--metrics flag writes all three metrics files under a single prefix in one step:

fgumi group \
  --input sorted.bam \
  --output grouped.bam \
  --strategy adjacency \
  --metrics my_sample

This produces:

  • my_sample.family_sizes.txt — histogram of UMI family sizes
  • my_sample.grouping_metrics.txt — overall grouping statistics
  • my_sample.position_group_sizes.txt — histogram of UMI families per genomic position

Using individual flags

The three metrics can also be written to explicit paths:

fgumi group \
  --input sorted.bam \
  --output grouped.bam \
  --strategy adjacency \
  --family-size-histogram family_sizes.txt \
  --grouping-metrics grouping_metrics.txt

Note: position_group_sizes.txt is only available via --metrics. The individual flags --family-size-histogram and --grouping-metrics can be used alongside --metrics.

Family sizes

The family_sizes.txt file is a histogram of how many reads belong to each UMI family. A large fraction of singleton families may indicate UMI collisions, over-sequencing, or UMI extraction errors.

Grouping metrics

The grouping_metrics.txt file contains summary statistics about the grouping run, including total reads, accepted reads, discarded reads by reason, and UMI assignment counts.

Position group sizes

The position_group_sizes.txt file is a histogram of how many distinct UMI families were observed at each unique genomic position (coordinate + strand). A distribution skewed toward large position groups may indicate high on-target duplication or UMI exhaustion.

Template-Coordinate Sort Order

fgumi group requires its input to be template-coordinate sorted. The header must advertise SO:unsorted, GO:query, and SS:template-coordinate; without SS:template-coordinate the input is treated as queryname-grouped (e.g. FASTQ-order output from fgumi extract) and rejected with an actionable error pointing back here. fgumi group does not sort internally — pre-sort with:

fgumi sort --order template-coordinate --input aligned.bam --output sorted.bam

The streaming grouper relies on records that share a position key being consecutive in the input, which is what template-coordinate sort guarantees. Any other ordering (queryname, coordinate, FASTQ-order) would split each true molecule across many small groups and assign distinct MI values to reads that should share one.

For single-cell data, the CB cell barcode tag is automatically incorporated in the sort key, keeping templates from different cells at the same locus separate:

fgumi sort --order template-coordinate --input aligned.bam --output sorted.bam

Template-coordinate order sorts reads by:

  1. The earlier unclipped 5’ coordinate of the read pair
  2. The higher unclipped 5’ coordinate of the read pair
  3. Strand orientation
  4. The cellular barcode (CB tag, if present)
  5. The molecular identifier (MI tag, if present)
  6. Read name
  7. Library (from read group)
  8. Whether R1 has the lower coordinates of the pair

Reads grouped by fgumi group with the same MI will share the same outer start/stop coordinates. Because 5’ coordinates are strand-aware, reads from opposite strands with the same UMI and position will not be grouped together (they belong to different strands of the same duplex molecule).

See also: Consensus Calling, Duplex Consensus Calling, Best Practices