downsample
Category: UTILITIES
Downsample BAM by UMI family using streaming
Description
Downsample a BAM file by UMI family using a single-pass streaming algorithm.
This tool reads a BAM file that has been processed by fgumi group (or fgbio GroupReadsByUmi) containing MI tags, uniformly samples UMI families, and outputs kept reads directly to a BAM file.
Requires input BAM to be in template-coordinate order:
- SO:unsorted (or not set)
- GO:query
- SS:unsorted:template-coordinate or SS:template-coordinate
The tool processes families in streaming fashion by grouping consecutive reads with the same MI tag value. For each family, a random decision is made based on the fraction parameter to either keep or reject all reads in that family.
Example usage: fgumi downsample -i grouped.bam -o downsampled.bam -f 0.1 –seed 42 fgumi downsample -i grouped.bam -o kept.bam -f 0.5 –rejects rejected.bam fgumi downsample -i grouped.bam -o kept.bam -f 0.1 –histogram-kept kept_hist.txt
Arguments
| Flag | Description | Default |
|---|---|---|
-i, --input <INPUT> | Input BAM file | required |
-o, --output <OUTPUT> | Output BAM file | required |
--async-reader <ASYNC_READER> | Enable async userspace prefetch on the input BAM | false |
-f, --fraction <FRACTION> | Fraction of UMI families to keep (0.0 exclusive to 1.0 inclusive) | required |
--rejects <REJECTS> | Optional output BAM file for rejected reads | |
--seed <SEED> | Random seed for reproducibility | |
--validate-mi-order <VALIDATE_MI_ORDER> | Validate that MI tags appear in consecutive groups (error if seen non-consecutively) | false |
--histogram-kept <HISTOGRAM_KEPT> | Output file for kept family size histogram | |
--histogram-rejected <HISTOGRAM_REJECTED> | Output file for rejected family size histogram | |
--compression-level <COMPRESSION_LEVEL> | Compression level for output BAM (0-12) | 1 |