Getting Started
This guide walks through a basic fgumi workflow from FASTQ files to filtered consensus reads.
Prerequisites
- fgumi installed (see Installation)
- A reference genome FASTA (with BWA index)
- Paired-end FASTQ files with UMI sequences
Basic Workflow
1. Extract UMIs from FASTQ
Extract UMIs from FASTQ reads and create an unmapped BAM. The --read-structures argument tells fgumi where UMI bases are located in each read. See Read Structures for details.
fgumi extract \
--inputs R1.fastq.gz R2.fastq.gz \
--read-structures +T +M \
--output unaligned.bam \
--sample MySample \
--library MyLibrary
2. (Optional) Correct UMIs
If using a fixed set of known UMIs, correct sequencing errors:
fgumi correct \
--input unaligned.bam \
--output corrected.bam \
--umi-files umis.txt \
--min-distance 1
3. Align and Sort
Use fgumi’s streaming pipeline to align with BWA and sort into template-coordinate order in a single pass:
fgumi fastq --input unaligned.bam \
| bwa mem -p ref.fa - \
| fgumi zipper --unmapped unaligned.bam \
| fgumi sort --output sorted.bam --order template-coordinate
This pipes reads through:
fastq— converts unmapped BAM to interleaved FASTQbwa mem— aligns reads to the referencezipper— merges aligned reads with original unmapped BAM to restore UMI tagssort— sorts into template-coordinate order for grouping
Note:
fgumi zipperaccepts SAM or BAM input, on stdin or via--input. For best performance, pipe uncompressed BAM from the aligner (e.g.bwa-mem3 mem --bam=0) — this skips both the SAM text formatting on the aligner side and the SAM parsing on the zipper side. SAM is fine for aligners that can’t emit BAM; compressed BAM on a pipe is not recommended (wasted CPU on both ends).
For single-cell data, the CB cell barcode tag is automatically included in the
template-coordinate sort key, keeping templates from different cells at the same locus separate:
fgumi fastq --input unaligned.bam \
| bwa mem -p ref.fa - \
| fgumi zipper --unmapped unaligned.bam \
| fgumi sort --output sorted.bam --order template-coordinate
3b. (Optional) Merge Multiple BAMs
If processing multiple lanes or flowcells separately, merge the sorted BAMs before grouping:
fgumi merge \
--order template-coordinate \
--output merged.bam \
lane1_sorted.bam lane2_sorted.bam lane3_sorted.bam
All inputs must be sorted in the same order. For large numbers of files, use --input-list:
fgumi merge \
--order template-coordinate \
--input-list bam_paths.txt \
--output merged.bam
For single-cell data, the CB cell barcode tag is automatically included in the merge key.
4. Group Reads by UMI
Group reads from the same original molecule together.
For duplex workflows, use paired strategy:
fgumi group \
--input sorted.bam \
--output grouped.bam \
--strategy paired
For simplex/codec workflows, use adjacency strategy:
fgumi group \
--input sorted.bam \
--output grouped.bam \
--strategy adjacency
To collect all grouping QC metrics under a single prefix:
fgumi group \
--input sorted.bam \
--output grouped.bam \
--strategy adjacency \
--metrics group_metrics
This writes group_metrics.family_sizes.txt, group_metrics.grouping_metrics.txt, and
group_metrics.position_group_sizes.txt in one step.
See UMI Grouping for details on grouping strategies.
5. Call Consensus Reads
Choose the consensus calling method based on your library preparation:
Simplex consensus (single-strand):
fgumi simplex \
--input grouped.bam \
--output consensus.bam
Duplex consensus (double-strand):
fgumi duplex \
--input grouped.bam \
--output duplex.bam
CODEC consensus:
fgumi codec \
--input grouped.bam \
--output codec_consensus.bam
See Consensus Calling and Duplex Consensus Calling for details.
6. (Optional) Collect QC Metrics
Collect QC metrics before filtering to understand your library.
For simplex libraries, use simplex-metrics on the grouped BAM:
fgumi simplex-metrics \
--input grouped.bam \
--output simplex_metrics
For duplex libraries, use duplex-metrics on the grouped BAM:
fgumi duplex-metrics \
--input grouped.bam \
--output duplex_metrics
Both commands write a set of metrics files under the given output prefix. See Working with Metrics for details on interpreting the output.
7. Filter Consensus Reads
Filter consensus reads based on quality metrics. The --min-reads format depends on the
consensus type:
For simplex consensus (single integer):
fgumi filter \
--input consensus.bam \
--output filtered.bam \
--ref ref.fa \
--min-reads 1
For duplex consensus (three comma-separated values: duplex,AB,BA):
fgumi filter \
--input duplex.bam \
--output filtered.bam \
--ref ref.fa \
--min-reads 1,1,1
8. (Optional) Clip Overlapping Reads
Clip overlapping bases in read pairs to avoid double-counting evidence:
fgumi clip \
--input filtered.bam \
--output clipped.bam \
--ref ref.fa
What’s Next
- Best Practices — recommended parameter settings and pipeline configuration
- Performance Tuning — threading, memory, and compression optimization
- Working with Metrics — understanding fgumi’s output metrics