Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

dedup

Category: DEDUP

Mark or remove PCR duplicates using UMI information

Description

Marks or removes PCR duplicates from a BAM file using UMI information. Requires template-coordinate sorted input with tc tags on secondary/supplementary reads (added by fgumi zipper).

Within each UMI family, the template with the highest sum of base qualities is selected as the representative; all others are marked as duplicates.

Input Requirements

  • Must be processed with fgumi zipper (adds tc tag for secondary/supplementary reads)
  • Must be sorted with fgumi sort --order template-coordinate
  • UMI tags on reads (RX tag), unless --no-umi is specified

Note: Using samtools sort will NOT work correctly because it doesn’t use the tc tag for template-coordinate ordering of secondary/supplementary reads.

Output Modes

  • Mark only (default): Set duplicate flag (0x400) on non-representative reads
  • Remove (–remove-duplicates): Exclude duplicate reads from output entirely

Cell Barcodes

If the input data contains cell barcodes (e.g. from single-cell sequencing), reads at the same genomic position are partitioned by cell barcode before deduplication. This ensures that reads from different cells are never marked as duplicates of each other, even if they share a UMI sequence and mapping position. The cell barcode is read from the standard CB tag. No correction or error-handling is performed on cell barcodes; they must be corrected upstream.

Arguments

FlagDescriptionDefault
-i, --input <INPUT>Input BAM filerequired
-o, --output <OUTPUT>Output BAM filerequired
--async-reader <ASYNC_READER>Enable async userspace prefetch on the input BAMfalse
-m, --metrics <METRICS>Path to write deduplication metrics
-H, --family-size-histogram <FAMILY_SIZE_HISTOGRAM>Path to write family size histogram
-r, --remove-duplicates <REMOVE_DUPLICATES>Remove duplicates instead of just marking themfalse
-q, --min-map-q <MIN_MAP_Q>Minimum mapping quality for a read to be included
-n, --include-non-pf-reads <INCLUDE_NON_PF_READS>Include reads flagged as not passing QCfalse
-s, --strategy <STRATEGY>UMI grouping strategyadjacency
-e, --edits <EDITS>Maximum edit distance for UMI grouping1
-l, --min-umi-length <MIN_UMI_LENGTH>Minimum UMI length (UMIs shorter than this are discarded)
--threads <THREADS>Number of threads for the multi-threaded pipeline
--compression-level <COMPRESSION_LEVEL>Compression level for output BAM (0-12)1
--index-threshold <INDEX_THRESHOLD>Minimum UMIs per position to use index for faster grouping100
--no-umi <NO_UMI>Skip UMI-based grouping; group by position only. Forces identity strategy and ignores any existing UMI tagsfalse
--scheduler <SCHEDULER>Scheduler strategy for thread work assignmentbalanced-chase-drain
--pipeline-stats <PIPELINE_STATS>Print detailed pipeline statistics at completionfalse
--deadlock-timeout <DEADLOCK_TIMEOUT>Timeout in seconds for deadlock detection (default: 10, 0 = disabled)10
--deadlock-recover <DEADLOCK_RECOVER>Enable automatic deadlock recovery (default: false, detection only)false
--queue-memory <QUEUE_MEMORY>Pipeline queue memory limit per thread (default) or total768
--queue-memory-per-thread <QUEUE_MEMORY_PER_THREAD>Interpret –queue-memory as per-thread (true, default) or total (false)true
--queue-memory-limit-mb <QUEUE_MEMORY_LIMIT_MB>DEPRECATED: Use –queue-memory instead. Memory limit for pipeline queues in megabytes