correct

Category: UMI EXTRACTION

Correct UMIs in a BAM file to a fixed set of UMIs

Description

Corrects UMIs stored in BAM files when a set of fixed UMIs is in use.

If the set of UMIs used in an experiment is known and is a subset of the possible randomers of the same length, it is possible to error-correct UMIs prior to grouping reads by UMI. This tool takes an input BAM with UMIs in the RX tag and set of known UMIs (either on the command line or in a file) and produces:

A new BAM with corrected UMIs written to the RX tag
Optionally a set of metrics about the representation of each UMI in the set
Optionally a second BAM file of reads whose UMIs could not be corrected within the specific parameters

All of the fixed UMIs must be of the same length, and all UMIs in the BAM file must also have the same length. Multiple UMIs that are concatenated with hyphens (e.g. AACCAGT-AGGTAGA) are split apart, corrected individually and then re-assembled. A read is accepted only if all the UMIs can be corrected.

Correction Parameters

Correction is controlled by two parameters that are applied per-UMI:

–max-mismatches controls how many mismatches (no-calls are counted as mismatches) are tolerated between a UMI as read and a fixed UMI
–min-distance controls how many more mismatches the next best hit must have

For example, with two fixed UMIs AAAAA and CCCCC and --max-mismatches=3 and --min-distance=2:

AAAAA would match to AAAAA
AAGTG would match to AAAAA with three mismatches because CCCCC has six mismatches and 6 >= 3 + 2
AACCA would be rejected because it is 2 mismatches to AAAAA and 3 to CCCCC and 3 <= 2 + 2

Specifying UMIs

The set of fixed UMIs may be specified on the command line using --umis umi1 umi2 ... or via one or more files of UMIs with a single sequence per line using --umi-files umis.txt more_umis.txt. If there are multiple UMIs per template, leading to hyphenated UMI tags, the values for the fixed UMIs should be single, non-hyphenated UMIs (e.g. if a record has RX:Z:ACGT-GGCA, you would use --umis ACGT GGCA).

Original UMI Storage

Records which have their UMIs corrected (i.e. the UMI is not identical to one of the expected UMIs but is close enough to be corrected) will by default have their original UMI stored in the OX tag. This can be disabled with the --dont-store-original-umis option.

Arguments

Flag	Description	Default
`-i, --input <INPUT>`	Input BAM file	required
`-o, --output <OUTPUT>`	Output BAM file	required
`--async-reader <ASYNC_READER>`	Enable async userspace prefetch on the input BAM	`false`
`-r, --rejects <REJECTS>`	Optional output BAM file for rejected reads
`-M, --metrics <METRICS>`	Optional output path for metrics TSV file
`--max-mismatches <MAX_MISMATCHES>`	Maximum number of mismatches allowed	`2`
`-d, --min-distance <MIN_DISTANCE_DIFF>`	Minimum difference between best and second-best match	required
`-u, --umis <UMIS>`	Fixed UMI sequences (can be specified multiple times)
`-U, --umi-files <UMI_FILES>`	Files containing UMI sequences, one per line
`--dont-store-original-umis <DONT_STORE_ORIGINAL_UMIS>`	Don’t store original UMIs in a separate tag	`false`
`--cache-size <CACHE_SIZE>`	Size of the LRU cache for UMI matching	`100000`
`--min-corrected <MIN_CORRECTED>`	Minimum fraction of reads that must pass correction
`--revcomp <REVCOMP>`	Reverse complement UMIs before matching	`false`
`--threads <THREADS>`	Number of threads for the multi-threaded pipeline
`--compression-level <COMPRESSION_LEVEL>`	Compression level for output BAM (1-12)	`1`
`--scheduler <SCHEDULER>`	Scheduler strategy for thread work assignment	`balanced-chase-drain`
`--pipeline-stats <PIPELINE_STATS>`	Print detailed pipeline statistics at completion	`false`
`--deadlock-timeout <DEADLOCK_TIMEOUT>`	Timeout in seconds for deadlock detection (default: 10, 0 = disabled)	`10`
`--deadlock-recover <DEADLOCK_RECOVER>`	Enable automatic deadlock recovery (default: false, detection only)	`false`
`--queue-memory <QUEUE_MEMORY>`	Pipeline queue memory limit per thread (default) or total	`768`
`--queue-memory-per-thread <QUEUE_MEMORY_PER_THREAD>`	Interpret –queue-memory as per-thread (true, default) or total (false)	`true`
`--queue-memory-limit-mb <QUEUE_MEMORY_LIMIT_MB>`	DEPRECATED: Use –queue-memory instead. Memory limit for pipeline queues in megabytes