Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

simplex

Category: CONSENSUS

Call simplex consensus sequences from UMI-grouped reads

Description

Calls consensus sequences from reads with the same unique molecular tag.

Reads with the same unique molecular tag are examined base-by-base to assess the likelihood of each base in the source molecule. The likelihood model is as follows:

  1. First, the base qualities are adjusted. The base qualities are assumed to represent the probability of a sequencing error (i.e. the sequencer observed the wrong base present on the cluster/flowcell/well). The base quality scores are converted to probabilities incorporating a probability representing the chance of an error from the time the unique molecular tags were integrated to just prior to sequencing. The resulting probability is the error rate of all processes from right after integrating the molecular tag through to the end of sequencing.
  2. Next, a consensus sequence is called for all reads with the same unique molecular tag base-by-base. For a given base position in the reads, the likelihoods that an A, C, G, or T is the base for the underlying source molecule respectively are computed by multiplying the likelihood of each read observing the base position being considered. The probability of error (from 1.) is used when the observed base does not match the hypothesized base for the underlying source molecule, while one minus that probability is used otherwise. The computed likelihoods are normalized by dividing them by the sum of all four likelihoods to produce a posterior probability, namely the probability that the source molecule was an A, C, G, or T from just after integrating molecular tag through to sequencing, given the observations. The base with the maximum posterior probability as the consensus call, and the posterior probability is used as its raw base quality.
  3. Finally, the consensus raw base quality is modified by incorporating the probability of an error prior to integrating the unique molecular tags. Therefore, the probability used for the final consensus base quality is the posterior probability of the source molecule having the consensus base given the observed reads with the same molecular tag, all the way from sample extraction and through sample and library preparation, through preparing the library for sequencing (e.g. amplification, target selection), and finally, through sequencing.

This tool assumes that reads with the same tag are grouped together (consecutive in the file). Also, this tool calls each end of a pair independently, and does not jointly call bases that overlap within a pair. Insertion or deletion errors in the reads are not considered in the consensus model.

The consensus reads produced are unaligned, due to the difficulty and error-prone nature of inferring the consensus alignment. Consensus reads should therefore be aligned after, which should not be too expensive as likely there are far fewer consensus reads than input raw reads.

Particular attention should be paid to setting the –min-reads parameter as this can have a dramatic effect on both results and runtime. For libraries with low duplication rates (e.g. 100-300X exomes libraries) in which it is desirable to retain singleton reads while making consensus reads from sets of duplicates, –min-reads=1 is appropriate. For libraries with high duplication rates where it is desirable to only produce consensus reads supported by 2+ reads to allow error correction, –min-reads=2 or higher is appropriate. After generation, consensus reads can be further filtered using the filter tool. As such it is always safe to run with –min-reads=1 and filter later, but filtering at this step can improve performance significantly.

Consensus reads have a number of additional optional tags set in the resulting BAM file. The tags break down into those that are single-valued per read:

consensus depth [cD] (int) : the maximum depth of raw reads at any point in the consensus read consensus min depth [cM] (int) : the minimum depth of raw reads at any point in the consensus read consensus error rate [cE] (float): the fraction of bases in raw reads disagreeing with the final consensus calls

And those that have a value per base:

consensus depth [cd] (short[]): the count of bases contributing to the consensus read at each position consensus errors [ce] (short[]): the number of bases from raw reads disagreeing with the final consensus base

The per base depths and errors are both capped at 32,767. In all cases no-calls (Ns) and bases below the –min-input-base-quality are not counted in tag value calculations.

Arguments

FlagDescriptionDefault
-i, --input <INPUT>Input BAM filerequired
-o, --output <OUTPUT>Output BAM filerequired
--async-reader <ASYNC_READER>Enable async userspace prefetch on the input BAMfalse
-r, --rejects <REJECTS>Optional output BAM file for rejected reads
-s, --stats <STATS>Optional output file for statistics
-p, --read-name-prefix <READ_NAME_PREFIX>Prefix for consensus read names
-R, --read-group-id <READ_GROUP_ID>Read group ID for consensus readsA
-1, --error-rate-pre-umi <ERROR_RATE_PRE_UMI>Phred-scaled error rate prior to UMI integration45
-2, --error-rate-post-umi <ERROR_RATE_POST_UMI>Phred-scaled error rate post UMI integration40
-m, --min-input-base-quality <MIN_INPUT_BASE_QUALITY>Minimum base quality in raw reads to use for consensus10
-B, --output-per-base-tags <OUTPUT_PER_BASE_TAGS>Produce per-base tags (cd, ce) in addition to per-read tagstrue
--trim <TRIM>Quality-trim reads before consensus calling (removes low-quality bases from ends)false
--min-consensus-base-quality <MIN_CONSENSUS_BASE_QUALITY>Minimum consensus base quality (output consensus bases below this are masked to N)2
--consensus-call-overlapping-bases <CONSENSUS_CALL_OVERLAPPING_BASES>Consensus call overlapping bases in read pairs before UMI consensus callingtrue
--threads <THREADS>Number of threads for the multi-threaded pipeline
--compression-level <COMPRESSION_LEVEL>Compression level for output BAM (1-12)1
-M, --min-reads <MIN_READS>Minimum number of reads to produce a consensus (required, no default) Matches fgbio’s CallMolecularConsensusReads which requires this argumentrequired
--max-reads <MAX_READS>Maximum reads to use per tag family (downsample if exceeded)
--scheduler <SCHEDULER>Scheduler strategy for thread work assignmentbalanced-chase-drain
--pipeline-stats <PIPELINE_STATS>Print detailed pipeline statistics at completionfalse
--deadlock-timeout <DEADLOCK_TIMEOUT>Timeout in seconds for deadlock detection (default: 10, 0 = disabled)10
--deadlock-recover <DEADLOCK_RECOVER>Enable automatic deadlock recovery (default: false, detection only)false
--queue-memory <QUEUE_MEMORY>Pipeline queue memory limit per thread (default) or total768
--queue-memory-per-thread <QUEUE_MEMORY_PER_THREAD>Interpret –queue-memory as per-thread (true, default) or total (false)true
--queue-memory-limit-mb <QUEUE_MEMORY_LIMIT_MB>DEPRECATED: Use –queue-memory instead. Memory limit for pipeline queues in megabytes
--methylation-mode <METHYLATION_MODE>Methylation-aware consensus calling mode. EM-Seq: C→T at ref-C = unmethylated (enzymatic conversion); TAPs: C→T at ref-C = methylated. Emits MM/ML methylation tags and cu/ct per-base count tags on consensus reads. Requires –ref
--ref <REFERENCE>Path to the reference FASTA file (required when –methylation-mode is set)