duplex-metrics
Category: POST-CONSENSUS
Collect QC metrics for duplex consensus reads
Description
Collects a suite of metrics to QC duplex sequencing data.
Inputs
The input to this tool must be a BAM file that is either:
- The exact BAM output by the
grouptool (in the sort-order it was produced in) - A BAM file that has MI tags present on all reads (usually set by
groupand has been sorted into template-coordinate order
Calculation of metrics may be restricted to a set of regions using the --intervals parameter.
This can significantly affect results as off-target reads in duplex sequencing experiments often
have very different properties than on-target reads due to the lack of enrichment.
Several metrics are calculated related to the fraction of tag families that have duplex coverage.
The definition of “duplex” is controlled by the --min-ab-reads and --min-ba-reads parameters.
The default is to treat any tag family with at least one observation of each strand as a duplex,
but this could be made more stringent, e.g. by setting --min-ab-reads=3 --min-ba-reads=3.
Outputs
The following output files are produced:
- <output>.family_sizes.txt: metrics on the frequency of different types of families of different sizes
- <output>.duplex_family_sizes.txt: metrics on the frequency of duplex tag families by the number of observations from each strand
- <output>.duplex_yield_metrics.txt: summary QC metrics produced using 5%, 10%, 15%…100% of the data
- <output>.umi_counts.txt: metrics on the frequency of observations of UMIs within reads and tag families
- <output>.duplex_umi_counts.txt: (optional) metrics on the frequency of observations of duplex UMIs within
reads and tag families. This file is only produced if the
--duplex-umi-countsoption is used as it requires significantly more memory to track all pairs of UMIs seen when a large number of UMI sequences are present. - <output>.duplex_qc.pdf: (optional) a series of plots generated from the preceding metrics files for
visualization. This file is only produced if R is available with the required
packages (ggplot2 and scales). Use
--descriptionto customize plot titles.
Within the metrics files the prefixes CS, SS and DS are used to mean:
- CS: tag families where membership is defined solely on matching genome coordinates and strand
- SS: single-stranded tag families where membership is defined by genome coordinates, strand and UMI; ie. 50/A and 50/B are considered different tag families
- DS: double-stranded tag families where membership is collapsed across single-stranded tag families from the same double-stranded source molecule; i.e. 50/A and 50/B become one family
Arguments
| Flag | Description | Default |
|---|---|---|
-i, --input <INPUT> | Input BAM file (UMI-grouped, from group) | required |
-o, --output <OUTPUT> | Output prefix for metrics files | required |
--min-ab-reads <MIN_AB_READS> | Minimum AB reads to call a duplex | 1 |
--min-ba-reads <MIN_BA_READS> | Minimum BA reads to call a duplex | 1 |
--duplex-umi-counts <DUPLEX_UMI_COUNTS> | Collect duplex UMI counts (memory intensive) | false |
-l, --intervals <INTERVALS> | Optional intervals file to restrict analysis (BED or Picard interval list format) | |
--description <DESCRIPTION> | Optional sample name or description for PDF plot titles |