Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

group

Category: GROUP

Group reads by UMI to identify reads from the same original molecule

Description

Groups reads together that appear to have come from the same original molecule. Reads are grouped by template, and then templates are sorted by the 5’ mapping positions of the reads from the template, used from earliest mapping position to latest. Reads that have the same end positions are then sub-grouped by UMI sequence.

Requires input to be template-coordinate sorted (header must advertise SO:unsorted, GO:query, and SS:template-coordinate). Sort upstream sources (fgumi extract, samtools sort -n, fgumi merge --order queryname, etc.) with fgumi sort -i input.bam -o sorted.bam --order template-coordinate before piping into this tool. Output is always written in template-coordinate order, sorted by:

  1. The lower genome coordinate of the two outer ends of the templates (strand-aware)
  2. The sequencing library
  3. The cell barcode (CB tag, if present)
  4. The assigned UMI tag
  5. Read Name

During grouping, reads and templates are filtered out as follows:

  1. Templates are filtered if all reads for the template are unmapped
  2. Templates are filtered if any non-secondary, non-supplementary read has mapping quality < min-map-q
  3. Templates are filtered if any UMI sequence contains one or more N bases
  4. Templates are filtered if –min-umi-length is specified and the UMI does not meet the length requirement
  5. Records are filtered out if flagged as either secondary or supplementary

Grouping of UMIs is performed by one of four strategies:

  1. identity: only reads with identical UMI sequences are grouped together. This strategy may be useful for evaluating data, but should generally be avoided as it will generate multiple UMI groups per original molecule in the presence of errors.
  2. edit: reads are clustered into groups such that each read within a group has at least one other read in the group with <= edits differences and there are inter-group pairings with <= edits differences. Effective when there are small numbers of reads per UMI, but breaks down at very high coverage of UMIs.
  3. adjacency: a version of the directed adjacency method described in umi_tools (http://dx.doi.org/10.1101/051755) that allows for errors between UMIs but only when there is a count gradient.
  4. paired: similar to adjacency but for methods that produce templates such that a read with A-B is related to but not identical to a read with B-A. Expects the UMI sequences to be stored in a single SAM tag separated by a hyphen (e.g. ACGT-CCGG) and allows for one of the two UMIs to be absent (e.g. ACGT- or -ACGT). The molecular IDs produced have more structure than for single UMI strategies and are of the form {base}/{A|B}. E.g. two UMI pairs would be mapped as follows: AAAA-GGGG -> 1/A, GGGG-AAAA -> 1/B.

Strategies edit, adjacency, and paired make use of the –edits parameter to control the matching of non-identical UMIs.

By default, all UMIs must be the same length. If –min-umi-length=len is specified then reads that have a UMI shorter than len will be discarded, and when comparing UMIs of different lengths, the first len bases will be compared, where len is the length of the shortest UMI. The UMI length is the number of [ACGT] bases in the UMI (i.e. does not count dashes and other non-ACGT characters). This option is not implemented for reads with UMI pairs (i.e. using the paired assigner).

Note: the –min-map-q parameter defaults to 0 in duplicate marking mode and 1 otherwise, and is directly settable on the command line.

Cell Barcodes

If the input data contains cell barcodes (e.g. from single-cell sequencing), reads at the same genomic position are partitioned by cell barcode before UMI grouping. This ensures that reads from different cells are never grouped together, even if they share a UMI sequence and mapping position. The cell barcode is read from the standard CB tag. No correction or error-handling is performed on cell barcodes; they must be corrected upstream.

Multi-threaded operation is supported via –threads N, which spawns N pipeline threads allocated based on the command’s workload profile to optimize performance.

Example: –threads 8 spawns 8 pipeline threads (2 reader, 4 workers, 2 writer)

Note: when –parallel-group-min-templates (or –allow-unmapped) engages the parallel UMI assigner, each parallel assigner constructs its own rayon thread pool of size –threads, independent of the pipeline threads above. As an example, one pipeline worker overlapping a single parallel assigner briefly runs ~2 * –threads OS threads; this is not an upper bound, because multiple pipeline workers can each spawn a –threads-sized pool concurrently and push the live thread count higher still. See –parallel-group-min-templates for details.

Arguments

FlagDescriptionDefault
-i, --input <INPUT>Input BAM filerequired
-o, --output <OUTPUT>Output BAM filerequired
--async-reader <ASYNC_READER>Enable async userspace prefetch on the input BAMfalse
-f, --family-size-histogram <FAMILY_SIZE_HISTOGRAM>Optional output of tag family size counts
-g, --grouping-metrics <GROUPING_METRICS>Optional output of UMI grouping metrics
-M, --metrics <METRICS>Output prefix for all group metrics files
-m, --min-map-q <MIN_MAP_Q>Minimum mapping quality for mapped reads
-n, --include-non-pf-reads <INCLUDE_NON_PF_READS>Include non-PF readsfalse
--allow-unmapped <ALLOW_UNMAPPED>Allow fully unmapped templates (both reads unmapped). Input must be template-coordinate sorted (fgumi sort --order template-coordinate)false
`–parallel-group-min-templates <Nauto>`Enable the parallel UMI assigner for position groups with at least this many templates. Useful for amplicon and other workflows where individual mapped position groups are very large; the default for normal whole-genome data is to stay sequential. Has an effect only when --threads is greater than 1: with --threads 1 the assigner always falls back to the sequential implementation
-s, --strategy <STRATEGY>The UMI assignment strategyrequired
-e, --edits <EDITS>The allowable number of edits between UMIs1
-l, --min-umi-length <MIN_UMI_LENGTH>The minimum UMI length
--threads <THREADS>Number of threads for the multi-threaded pipeline
--compression-level <COMPRESSION_LEVEL>Compression level for output BAM (0-12)1
--index-threshold <INDEX_THRESHOLD>Minimum UMIs per position to use N-gram/BK-tree index for faster grouping. Set to 0 to always use linear scan. Only affects Adjacency/Paired strategies100
--no-umi <NO_UMI>Skip UMI-based grouping; group by position only. Forces identity strategy and ignores any existing UMI tagsfalse
--scheduler <SCHEDULER>Scheduler strategy for thread work assignmentbalanced-chase-drain
--pipeline-stats <PIPELINE_STATS>Print detailed pipeline statistics at completionfalse
--deadlock-timeout <DEADLOCK_TIMEOUT>Timeout in seconds for deadlock detection (default: 10, 0 = disabled)10
--deadlock-recover <DEADLOCK_RECOVER>Enable automatic deadlock recovery (default: false, detection only)false
--queue-memory <QUEUE_MEMORY>Pipeline queue memory limit per thread (default) or total768
--queue-memory-per-thread <QUEUE_MEMORY_PER_THREAD>Interpret –queue-memory as per-thread (true, default) or total (false)true
--queue-memory-limit-mb <QUEUE_MEMORY_LIMIT_MB>DEPRECATED: Use –queue-memory instead. Memory limit for pipeline queues in megabytes