UMI Grouping
Overview
fgumi group assigns reads that appear to come from the same original molecule to the same group by writing a shared Molecular Identifier (MI) tag. Grouping relies on template-coordinate sort order.
This page describes:
- How reads and templates are filtered before grouping
- How mapping coordinates and UMIs identify reads from the same molecule
- Template-coordinate sort order
- Cell barcode support
- Metrics output
Filtering Reads and Templates
A read is a single sequenced strand. A template is all reads sharing the same query name (typically a read pair).
| Concept | Definition | Example |
|---|---|---|
| Read | A single sequenced strand (R1 or R2) | @read123/1 |
| Template | The full fragment, represented by both reads in a pair | @read123 includes both /1 and /2 |
Reads and templates are filtered before grouping to prevent splitting reads from a single molecule into separate groups.
Individual reads are filtered if:
- Flagged as secondary (unless
--include-secondary) - Flagged as supplementary (unless
--include-supplementary)
All reads for a template are filtered if:
- All reads for the template are unmapped (unless
--allow-unmapped) - Any non-secondary, non-supplementary read has mapping quality <
--min-map-q - Any UMI sequence contains one or more
Nbases --min-umi-lengthis specified and the UMI does not meet the length requirement
Grouping Unmapped Reads
By default, templates where all reads are unmapped are excluded from grouping. Pass --allow-unmapped
to include them. This is useful for workflows where some templates genuinely fail to align
(e.g. cell-free DNA fragments that fall outside the target region) but should still be counted
and may share UMIs with mapped templates from the same molecule:
fgumi group \
--input sorted.bam \
--output grouped.bam \
--strategy adjacency \
--allow-unmapped
Grouping Strategies
Grouping is performed by one of four strategies:
identity
Only reads with identical UMI sequences are grouped together. This is simpler and faster than other strategies, but should usually be avoided because sequencing errors in the UMI will split reads from the same molecule into separate groups. Useful for data exploration.
edit
Reads are clustered into groups such that each read within a group has at least one other read in the group with <= --edits differences, and there are no inter-group pairings with <= --edits differences. Effective when there are small numbers of reads per UMI, but breaks down at very high UMI coverage.
adjacency
A version of the directed adjacency method described in umi_tools that allows for errors between UMIs but only when there is a count gradient. Recommended for most simplex and CODEC workflows.
paired
Similar to adjacency but for duplex sequencing where each template has two UMIs (one from each strand). Expects UMI sequences stored in a single tag separated by a hyphen (e.g. ACGT-CCGG). Allows one UMI to be absent (e.g. ACGT- or -ACGT).
The molecular IDs produced have structure: {base}/{A|B}. For example, UMI pairs AAAA-GGGG and GGGG-AAAA map to 1/A and 1/B respectively. See Tracking Reads for details. Recommended for duplex workflows.
The edit, adjacency, and paired strategies use the --edits parameter to control matching of non-identical UMIs.
Cell Barcode Support
When processing data with cell barcodes (e.g. single-cell sequencing), reads at the same genomic position are partitioned by cell barcode before UMI assignment. This ensures that reads from different cells are never grouped together, even if they share a UMI and mapping position.
The cell barcode is read from the standard CB tag. No correction or
error-handling is performed on cell barcodes — they must be corrected upstream before grouping.
Cell barcodes are detected automatically across the entire pipeline — no additional flags are needed. The consensus callers validate that all source reads in a group share the same cell barcode and propagate it to the output consensus read.
Metrics Output
fgumi group can emit three types of metrics files. They can be specified individually or all at
once with the --metrics prefix flag.
Using --metrics (recommended)
The -M/--metrics flag writes all three metrics files under a single prefix in one step:
fgumi group \
--input sorted.bam \
--output grouped.bam \
--strategy adjacency \
--metrics my_sample
This produces:
my_sample.family_sizes.txt— histogram of UMI family sizesmy_sample.grouping_metrics.txt— overall grouping statisticsmy_sample.position_group_sizes.txt— histogram of UMI families per genomic position
Using individual flags
The three metrics can also be written to explicit paths:
fgumi group \
--input sorted.bam \
--output grouped.bam \
--strategy adjacency \
--family-size-histogram family_sizes.txt \
--grouping-metrics grouping_metrics.txt
Note: position_group_sizes.txt is only available via --metrics. The individual flags
--family-size-histogram and --grouping-metrics can be used alongside --metrics.
Family sizes
The family_sizes.txt file is a histogram of how many reads belong to each UMI family. A large
fraction of singleton families may indicate UMI collisions, over-sequencing, or UMI extraction
errors.
Grouping metrics
The grouping_metrics.txt file contains summary statistics about the grouping run, including
total reads, accepted reads, discarded reads by reason, and UMI assignment counts.
Position group sizes
The position_group_sizes.txt file is a histogram of how many distinct UMI families were
observed at each unique genomic position (coordinate + strand). A distribution skewed toward
large position groups may indicate high on-target duplication or UMI exhaustion.
Template-Coordinate Sort Order
fgumi group requires its input to be template-coordinate sorted. The header must advertise
SO:unsorted, GO:query, and SS:template-coordinate; without SS:template-coordinate the
input is treated as queryname-grouped (e.g. FASTQ-order output from fgumi extract) and
rejected with an actionable error pointing back here. fgumi group does not sort
internally — pre-sort with:
fgumi sort --order template-coordinate --input aligned.bam --output sorted.bam
The streaming grouper relies on records that share a position key being consecutive in the
input, which is what template-coordinate sort guarantees. Any other ordering (queryname,
coordinate, FASTQ-order) would split each true molecule across many small groups and assign
distinct MI values to reads that should share one.
For single-cell data, the CB cell barcode tag is automatically incorporated in the sort key,
keeping templates from different cells at the same locus separate:
fgumi sort --order template-coordinate --input aligned.bam --output sorted.bam
Template-coordinate order sorts reads by:
- The earlier unclipped 5’ coordinate of the read pair
- The higher unclipped 5’ coordinate of the read pair
- Strand orientation
- The cellular barcode (CB tag, if present)
- The molecular identifier (MI tag, if present)
- Read name
- Library (from read group)
- Whether R1 has the lower coordinates of the pair
Reads grouped by fgumi group with the same MI will share the same outer start/stop coordinates.
Because 5’ coordinates are strand-aware, reads from opposite strands with the same UMI and
position will not be grouped together (they belong to different strands of the same duplex
molecule).
See also: Consensus Calling, Duplex Consensus Calling, Best Practices