Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

filter

Category: POST-CONSENSUS

Filter consensus reads based on quality metrics

Description

Filters consensus reads generated by simplex or duplex commands. Two kinds of filtering are performed:

  1. Masking/filtering of individual bases in reads
  2. Filtering out of reads (i.e. not writing them to the output file)

Base-level filtering/masking is only applied if per-base tags are present (see duplex and simplex for descriptions of these tags). Read-level filtering is always applied. When filtering reads, secondary alignments and supplementary records may be removed independently if they fail one or more filters; if either R1 or R2 primary alignments fail a filter then all records for the template will be filtered out.

The filters applied are as follows:

  1. Reads with fewer than min-reads contributing reads are filtered out
  2. Reads with an average consensus error rate higher than max-read-error-rate are filtered out
  3. Reads with mean base quality of the consensus read, prior to any masking, less than min-mean-base-quality are filtered out (if specified)
  4. Bases with quality scores below min-base-quality are masked to Ns
  5. Bases with fewer than min-reads contributing raw reads are masked to Ns
  6. Bases with a consensus error rate (defined as the fraction of contributing reads that voted for a different base than the consensus call) higher than max-base-error-rate are masked to Ns
  7. Reads with a fraction or count of Ns higher than max-no-call-fraction after per-base filtering are filtered out.

When filtering single-umi consensus reads generated by simplex, a single value each should be supplied for –min-reads, –max-read-error-rate, and –max-base-error-rate.

When filtering duplex consensus reads generated by duplex, each of the three parameters may independently take 1-3 values. For example:

fgumi filter … –min-reads 10,5,3 –max-base-error-rate 0.1

In each case if fewer than three values are supplied, the last value is repeated (i.e. 80,40 -> 80 40 40 and 0.1 -> 0.1 0.1 0.1). The first value applies to the final consensus read, the second value to one single-strand consensus, and the last value to the other single-strand consensus. It is required that if values two and three differ, the more stringent value comes earlier.

In order to correctly filter reads in or out by template, the input BAM must be either queryname sorted or query grouped. If your BAM is not already in an appropriate order, this can be done in streaming fashion with:

fgumi sort -i in.bam –order queryname | fgumi filter -i /dev/stdin …

The output sort order may be specified with –sort-order. If not given, then the output will be in the same order as input.

The –reverse-per-base-tags option controls whether per-base tags should be reversed before being used on reads marked as being mapped to the negative strand. This is necessary if the reads have been mapped and the bases/quals reversed but the consensus tags have not. If true, the tags written to the output BAM will be reversed where necessary in order to line up with the bases and quals.

Arguments

FlagDescriptionDefault
-i, --input <INPUT>Input BAM filerequired
-o, --output <OUTPUT>Output BAM filerequired
--async-reader <ASYNC_READER>Enable async userspace prefetch on the input BAMfalse
-r, --ref <REFERENCE>Reference FASTA file for NM/UQ/MD tag regeneration. If not provided, alignment tag regeneration (NM/UQ/MD) is skipped
-M, --min-reads <MIN_READS>Minimum number of raw reads to support a single-strand consensus base/read. For duplex: provide 1-3 values for [duplex, single-strand consensus, single-strand consensus]
-E, --max-read-error-rate <MAX_READ_ERROR_RATE>Maximum raw read error rate for a single-strand consensus base/read (0.0-1.0). For duplex: provide 1-3 values for [duplex, single-strand consensus, single-strand consensus]0.025
-e, --max-base-error-rate <MAX_BASE_ERROR_RATE>Maximum base error rate across raw reads (0.0-1.0). For duplex: provide 1-3 values for [duplex, AB consensus, BA consensus]0.1
-N, --min-base-quality <MIN_BASE_QUALITY>Minimum base quality score (after masking)
-q, --min-mean-base-quality <MIN_MEAN_BASE_QUALITY>Minimum mean base quality across the read (after masking)
-n, --max-no-call-fraction <MAX_NO_CALL_FRACTION>Maximum no-calls (N bases) allowed in a read0.2
-R, --reverse-per-base-tags <REVERSE_PER_BASE_TAGS>Reverse per-base tags for negative strand readsfalse
--threads <THREADS>Number of threads for the multi-threaded pipeline
--filter-by-template <FILTER_BY_TEMPLATE>Filter templates together (all primary reads must pass)true
--rejects <REJECTS>Optional output BAM file for rejected reads
--stats <STATS>Optional output file for filtering statistics
-s, --require-single-strand-agreement <REQUIRE_SINGLE_STRAND_AGREEMENT>Require single-strand agreement for duplex consensus (mask bases where AB and BA disagree)false
--min-methylation-depth <MIN_METHYLATION_DEPTH>Minimum methylation depth (cu+ct) to keep a base call (EM-Seq/TAPs). For duplex: provide 1-3 values for [duplex, AB consensus, BA consensus]
--require-strand-methylation-agreement <REQUIRE_STRAND_METHYLATION_AGREEMENT>Require strand methylation agreement at CpG sites for duplex consensus (EM-Seq/TAPs). Masks both positions of a CpG dinucleotide when top and bottom strands disagree on methylation status. Requires –reffalse
--min-conversion-fraction <MIN_CONVERSION_FRACTION>Minimum bisulfite/enzymatic conversion fraction at non-CpG cytosines. For EM-Seq: checks converted/total >= threshold (high conversion = good). For TAPs: checks unconverted/total >= threshold (low conversion = good). Requires –ref and –methylation-mode. Uses cu/ct tags
--methylation-mode <METHYLATION_MODE>Methylation mode for conversion fraction filtering. Required when using –min-conversion-fraction. Controls whether the conversion fraction check uses converted (em-seq) or unconverted (taps) counts as the numerator
--compression-level <COMPRESSION_LEVEL>Compression level for output BAM (0-12)1
--scheduler <SCHEDULER>Scheduler strategy for thread work assignmentbalanced-chase-drain
--pipeline-stats <PIPELINE_STATS>Print detailed pipeline statistics at completionfalse
--deadlock-timeout <DEADLOCK_TIMEOUT>Timeout in seconds for deadlock detection (default: 10, 0 = disabled)10
--deadlock-recover <DEADLOCK_RECOVER>Enable automatic deadlock recovery (default: false, detection only)false
--queue-memory <QUEUE_MEMORY>Pipeline queue memory limit per thread (default) or total768
--queue-memory-per-thread <QUEUE_MEMORY_PER_THREAD>Interpret –queue-memory as per-thread (true, default) or total (false)true
--queue-memory-limit-mb <QUEUE_MEMORY_LIMIT_MB>DEPRECATED: Use –queue-memory instead. Memory limit for pipeline queues in megabytes