duplex-metrics

Category: POST-CONSENSUS

Collect QC metrics for duplex consensus reads

Description

Collects a suite of metrics to QC duplex sequencing data.

Inputs

The input to this tool must be a BAM file that is either:

The exact BAM output by the group tool (in the sort-order it was produced in)
A BAM file that has MI tags present on all reads (usually set by group and has been sorted into template-coordinate order

Calculation of metrics may be restricted to a set of regions using the --intervals parameter. This can significantly affect results as off-target reads in duplex sequencing experiments often have very different properties than on-target reads due to the lack of enrichment.

Several metrics are calculated related to the fraction of tag families that have duplex coverage. The definition of “duplex” is controlled by the --min-ab-reads and --min-ba-reads parameters. The default is to treat any tag family with at least one observation of each strand as a duplex, but this could be made more stringent, e.g. by setting --min-ab-reads=3 --min-ba-reads=3.

Outputs

The following output files are produced:

<output>.family_sizes.txt: metrics on the frequency of different types of families of different sizes
<output>.duplex_family_sizes.txt: metrics on the frequency of duplex tag families by the number of observations from each strand
<output>.duplex_yield_metrics.txt: summary QC metrics produced using 5%, 10%, 15%…100% of the data
<output>.umi_counts.txt: metrics on the frequency of observations of UMIs within reads and tag families
<output>.duplex_umi_counts.txt: (optional) metrics on the frequency of observations of duplex UMIs within reads and tag families. This file is only produced if the --duplex-umi-counts option is used as it requires significantly more memory to track all pairs of UMIs seen when a large number of UMI sequences are present.
<output>.duplex_qc.pdf: (optional) a series of plots generated from the preceding metrics files for visualization. This file is only produced if R is available with the required packages (ggplot2 and scales). Use --description to customize plot titles.

Within the metrics files the prefixes CS, SS and DS are used to mean:

CS: tag families where membership is defined solely on matching genome coordinates and strand
SS: single-stranded tag families where membership is defined by genome coordinates, strand and UMI; ie. 50/A and 50/B are considered different tag families
DS: double-stranded tag families where membership is collapsed across single-stranded tag families from the same double-stranded source molecule; i.e. 50/A and 50/B become one family

Arguments

Flag	Description	Default
`-i, --input <INPUT>`	Input BAM file (UMI-grouped, from `group`)	required
`-o, --output <OUTPUT>`	Output prefix for metrics files	required
`--min-ab-reads <MIN_AB_READS>`	Minimum AB reads to call a duplex	`1`
`--min-ba-reads <MIN_BA_READS>`	Minimum BA reads to call a duplex	`1`
`--duplex-umi-counts <DUPLEX_UMI_COUNTS>`	Collect duplex UMI counts (memory intensive)	`false`
`-l, --intervals <INTERVALS>`	Optional intervals file to restrict analysis (BED or Picard interval list format)
`--description <DESCRIPTION>`	Optional sample name or description for PDF plot titles

Keyboard shortcuts

fgumi

duplex-metrics

Description

Inputs

Outputs

Arguments