correct
Category: UMI EXTRACTION
Correct UMIs in a BAM file to a fixed set of UMIs
Description
Corrects UMIs stored in BAM files when a set of fixed UMIs is in use.
If the set of UMIs used in an experiment is known and is a subset of the possible randomers
of the same length, it is possible to error-correct UMIs prior to grouping reads by UMI. This
tool takes an input BAM with UMIs in the RX tag and set of known UMIs (either on
the command line or in a file) and produces:
- A new BAM with corrected UMIs written to the
RXtag - Optionally a set of metrics about the representation of each UMI in the set
- Optionally a second BAM file of reads whose UMIs could not be corrected within the specific parameters
All of the fixed UMIs must be of the same length, and all UMIs in the BAM file must also have
the same length. Multiple UMIs that are concatenated with hyphens (e.g. AACCAGT-AGGTAGA) are
split apart, corrected individually and then re-assembled. A read is accepted only if all the
UMIs can be corrected.
Correction Parameters
Correction is controlled by two parameters that are applied per-UMI:
- –max-mismatches controls how many mismatches (no-calls are counted as mismatches) are tolerated between a UMI as read and a fixed UMI
- –min-distance controls how many more mismatches the next best hit must have
For example, with two fixed UMIs AAAAA and CCCCC and --max-mismatches=3 and --min-distance=2:
- AAAAA would match to AAAAA
- AAGTG would match to AAAAA with three mismatches because CCCCC has six mismatches and 6 >= 3 + 2
- AACCA would be rejected because it is 2 mismatches to AAAAA and 3 to CCCCC and 3 <= 2 + 2
Specifying UMIs
The set of fixed UMIs may be specified on the command line using --umis umi1 umi2 ... or via
one or more files of UMIs with a single sequence per line using --umi-files umis.txt more_umis.txt.
If there are multiple UMIs per template, leading to hyphenated UMI tags, the values for the fixed
UMIs should be single, non-hyphenated UMIs (e.g. if a record has RX:Z:ACGT-GGCA, you would use
--umis ACGT GGCA).
Original UMI Storage
Records which have their UMIs corrected (i.e. the UMI is not identical to one of the expected
UMIs but is close enough to be corrected) will by default have their original UMI stored in the
OX tag. This can be disabled with the --dont-store-original-umis option.
Arguments
| Flag | Description | Default |
|---|---|---|
-i, --input <INPUT> | Input BAM file | required |
-o, --output <OUTPUT> | Output BAM file | required |
--async-reader <ASYNC_READER> | Enable async userspace prefetch on the input BAM | false |
-r, --rejects <REJECTS> | Optional output BAM file for rejected reads | |
-M, --metrics <METRICS> | Optional output path for metrics TSV file | |
--max-mismatches <MAX_MISMATCHES> | Maximum number of mismatches allowed | 2 |
-d, --min-distance <MIN_DISTANCE_DIFF> | Minimum difference between best and second-best match | required |
-u, --umis <UMIS> | Fixed UMI sequences (can be specified multiple times) | |
-U, --umi-files <UMI_FILES> | Files containing UMI sequences, one per line | |
--dont-store-original-umis <DONT_STORE_ORIGINAL_UMIS> | Don’t store original UMIs in a separate tag | false |
--cache-size <CACHE_SIZE> | Size of the LRU cache for UMI matching | 100000 |
--min-corrected <MIN_CORRECTED> | Minimum fraction of reads that must pass correction | |
--revcomp <REVCOMP> | Reverse complement UMIs before matching | false |
--threads <THREADS> | Number of threads for the multi-threaded pipeline | |
--compression-level <COMPRESSION_LEVEL> | Compression level for output BAM (1-12) | 1 |
--scheduler <SCHEDULER> | Scheduler strategy for thread work assignment | balanced-chase-drain |
--pipeline-stats <PIPELINE_STATS> | Print detailed pipeline statistics at completion | false |
--deadlock-timeout <DEADLOCK_TIMEOUT> | Timeout in seconds for deadlock detection (default: 10, 0 = disabled) | 10 |
--deadlock-recover <DEADLOCK_RECOVER> | Enable automatic deadlock recovery (default: false, detection only) | false |
--queue-memory <QUEUE_MEMORY> | Pipeline queue memory limit per thread (default) or total | 768 |
--queue-memory-per-thread <QUEUE_MEMORY_PER_THREAD> | Interpret –queue-memory as per-thread (true, default) or total (false) | true |
--queue-memory-limit-mb <QUEUE_MEMORY_LIMIT_MB> | DEPRECATED: Use –queue-memory instead. Memory limit for pipeline queues in megabytes |