dedup

Category: DEDUP

Mark or remove PCR duplicates using UMI information

Description

Marks or removes PCR duplicates from a BAM file using UMI information. Requires template-coordinate sorted input with tc tags on secondary/supplementary reads (added by fgumi zipper).

Within each UMI family, the template with the highest sum of base qualities is selected as the representative; all others are marked as duplicates.

Input Requirements

Must be processed with fgumi zipper (adds tc tag for secondary/supplementary reads)
Must be sorted with fgumi sort --order template-coordinate
UMI tags on reads (RX tag), unless --no-umi is specified

Note: Using samtools sort will NOT work correctly because it doesn’t use the tc tag for template-coordinate ordering of secondary/supplementary reads.

Output Modes

Mark only (default): Set duplicate flag (0x400) on non-representative reads
Remove (–remove-duplicates): Exclude duplicate reads from output entirely

Cell Barcodes

If the input data contains cell barcodes (e.g. from single-cell sequencing), reads at the same genomic position are partitioned by cell barcode before deduplication. This ensures that reads from different cells are never marked as duplicates of each other, even if they share a UMI sequence and mapping position. The cell barcode is read from the standard CB tag. No correction or error-handling is performed on cell barcodes; they must be corrected upstream.

Arguments

Flag	Description	Default
`-i, --input <INPUT>`	Input BAM file	required
`-o, --output <OUTPUT>`	Output BAM file	required
`--async-reader <ASYNC_READER>`	Enable async userspace prefetch on the input BAM	`false`
`-m, --metrics <METRICS>`	Path to write deduplication metrics
`-H, --family-size-histogram <FAMILY_SIZE_HISTOGRAM>`	Path to write family size histogram
`-r, --remove-duplicates <REMOVE_DUPLICATES>`	Remove duplicates instead of just marking them	`false`
`-q, --min-map-q <MIN_MAP_Q>`	Minimum mapping quality for a read to be included
`-n, --include-non-pf-reads <INCLUDE_NON_PF_READS>`	Include reads flagged as not passing QC	`false`
`-s, --strategy <STRATEGY>`	UMI grouping strategy	`adjacency`
`-e, --edits <EDITS>`	Maximum edit distance for UMI grouping	`1`
`-l, --min-umi-length <MIN_UMI_LENGTH>`	Minimum UMI length (UMIs shorter than this are discarded)
`--threads <THREADS>`	Number of threads for the multi-threaded pipeline
`--compression-level <COMPRESSION_LEVEL>`	Compression level for output BAM (0-12)	`1`
`--index-threshold <INDEX_THRESHOLD>`	Minimum UMIs per position to use index for faster grouping	`100`
`--no-umi <NO_UMI>`	Skip UMI-based grouping; group by position only. Forces identity strategy and ignores any existing UMI tags	`false`
`--scheduler <SCHEDULER>`	Scheduler strategy for thread work assignment	`balanced-chase-drain`
`--pipeline-stats <PIPELINE_STATS>`	Print detailed pipeline statistics at completion	`false`
`--deadlock-timeout <DEADLOCK_TIMEOUT>`	Timeout in seconds for deadlock detection (default: 10, 0 = disabled)	`10`
`--deadlock-recover <DEADLOCK_RECOVER>`	Enable automatic deadlock recovery (default: false, detection only)	`false`
`--queue-memory <QUEUE_MEMORY>`	Pipeline queue memory limit per thread (default) or total	`768`
`--queue-memory-per-thread <QUEUE_MEMORY_PER_THREAD>`	Interpret –queue-memory as per-thread (true, default) or total (false)	`true`
`--queue-memory-limit-mb <QUEUE_MEMORY_LIMIT_MB>`	DEPRECATED: Use –queue-memory instead. Memory limit for pipeline queues in megabytes