dedup
Category: DEDUP
Mark or remove PCR duplicates using UMI information
Description
Marks or removes PCR duplicates from a BAM file using UMI information.
Requires template-coordinate sorted input with tc tags on secondary/supplementary
reads (added by fgumi zipper).
Within each UMI family, the template with the highest sum of base qualities is selected as the representative; all others are marked as duplicates.
Input Requirements
- Must be processed with
fgumi zipper(addstctag for secondary/supplementary reads) - Must be sorted with
fgumi sort --order template-coordinate - UMI tags on reads (RX tag), unless
--no-umiis specified
Note: Using samtools sort will NOT work correctly because it doesn’t use the
tc tag for template-coordinate ordering of secondary/supplementary reads.
Output Modes
- Mark only (default): Set duplicate flag (0x400) on non-representative reads
- Remove (–remove-duplicates): Exclude duplicate reads from output entirely
Cell Barcodes
If the input data contains cell barcodes (e.g. from single-cell sequencing), reads at the same
genomic position are partitioned by cell barcode before deduplication. This ensures that reads from
different cells are never marked as duplicates of each other, even if they share a UMI sequence and
mapping position. The cell barcode is read from the standard CB tag. No
correction or error-handling is performed on cell barcodes; they must be corrected upstream.
Arguments
| Flag | Description | Default |
|---|---|---|
-i, --input <INPUT> | Input BAM file | required |
-o, --output <OUTPUT> | Output BAM file | required |
--async-reader <ASYNC_READER> | Enable async userspace prefetch on the input BAM | false |
-m, --metrics <METRICS> | Path to write deduplication metrics | |
-H, --family-size-histogram <FAMILY_SIZE_HISTOGRAM> | Path to write family size histogram | |
-r, --remove-duplicates <REMOVE_DUPLICATES> | Remove duplicates instead of just marking them | false |
-q, --min-map-q <MIN_MAP_Q> | Minimum mapping quality for a read to be included | |
-n, --include-non-pf-reads <INCLUDE_NON_PF_READS> | Include reads flagged as not passing QC | false |
-s, --strategy <STRATEGY> | UMI grouping strategy | adjacency |
-e, --edits <EDITS> | Maximum edit distance for UMI grouping | 1 |
-l, --min-umi-length <MIN_UMI_LENGTH> | Minimum UMI length (UMIs shorter than this are discarded) | |
--threads <THREADS> | Number of threads for the multi-threaded pipeline | |
--compression-level <COMPRESSION_LEVEL> | Compression level for output BAM (0-12) | 1 |
--index-threshold <INDEX_THRESHOLD> | Minimum UMIs per position to use index for faster grouping | 100 |
--no-umi <NO_UMI> | Skip UMI-based grouping; group by position only. Forces identity strategy and ignores any existing UMI tags | false |
--scheduler <SCHEDULER> | Scheduler strategy for thread work assignment | balanced-chase-drain |
--pipeline-stats <PIPELINE_STATS> | Print detailed pipeline statistics at completion | false |
--deadlock-timeout <DEADLOCK_TIMEOUT> | Timeout in seconds for deadlock detection (default: 10, 0 = disabled) | 10 |
--deadlock-recover <DEADLOCK_RECOVER> | Enable automatic deadlock recovery (default: false, detection only) | false |
--queue-memory <QUEUE_MEMORY> | Pipeline queue memory limit per thread (default) or total | 768 |
--queue-memory-per-thread <QUEUE_MEMORY_PER_THREAD> | Interpret –queue-memory as per-thread (true, default) or total (false) | true |
--queue-memory-limit-mb <QUEUE_MEMORY_LIMIT_MB> | DEPRECATED: Use –queue-memory instead. Memory limit for pipeline queues in megabytes |