Getting Started

This guide walks through a basic fgumi workflow from FASTQ files to filtered consensus reads.

Prerequisites

fgumi installed (see Installation)
A reference genome FASTA (with BWA index)
Paired-end FASTQ files with UMI sequences

Basic Workflow

1. Extract UMIs from FASTQ

Extract UMIs from FASTQ reads and create an unmapped BAM. The --read-structures argument tells fgumi where UMI bases are located in each read. See Read Structures for details.

fgumi extract \
  --inputs R1.fastq.gz R2.fastq.gz \
  --read-structures +T +M \
  --output unaligned.bam \
  --sample MySample \
  --library MyLibrary

2. (Optional) Correct UMIs

If using a fixed set of known UMIs, correct sequencing errors:

fgumi correct \
  --input unaligned.bam \
  --output corrected.bam \
  --umi-files umis.txt \
  --min-distance 1

3. Align and Sort

Use fgumi’s streaming pipeline to align with BWA and sort into template-coordinate order in a single pass:

fgumi fastq --input unaligned.bam \
  | bwa mem -p ref.fa - \
  | fgumi zipper --unmapped unaligned.bam \
  | fgumi sort --output sorted.bam --order template-coordinate

This pipes reads through:

fastq — converts unmapped BAM to interleaved FASTQ
bwa mem — aligns reads to the reference
zipper — merges aligned reads with original unmapped BAM to restore UMI tags
sort — sorts into template-coordinate order for grouping

Note: fgumi zipper accepts SAM or BAM input, on stdin or via --input. For best performance, pipe uncompressed BAM from the aligner (e.g. bwa-mem3 mem --bam=0) — this skips both the SAM text formatting on the aligner side and the SAM parsing on the zipper side. SAM is fine for aligners that can’t emit BAM; compressed BAM on a pipe is not recommended (wasted CPU on both ends).

For single-cell data, the CB cell barcode tag is automatically included in the template-coordinate sort key, keeping templates from different cells at the same locus separate:

fgumi fastq --input unaligned.bam \
  | bwa mem -p ref.fa - \
  | fgumi zipper --unmapped unaligned.bam \
  | fgumi sort --output sorted.bam --order template-coordinate

3b. (Optional) Merge Multiple BAMs

If processing multiple lanes or flowcells separately, merge the sorted BAMs before grouping:

fgumi merge \
  --order template-coordinate \
  --output merged.bam \
  lane1_sorted.bam lane2_sorted.bam lane3_sorted.bam

All inputs must be sorted in the same order. For large numbers of files, use --input-list:

fgumi merge \
  --order template-coordinate \
  --input-list bam_paths.txt \
  --output merged.bam

For single-cell data, the CB cell barcode tag is automatically included in the merge key.

4. Group Reads by UMI

Group reads from the same original molecule together.

For duplex workflows, use paired strategy:

fgumi group \
  --input sorted.bam \
  --output grouped.bam \
  --strategy paired

For simplex/codec workflows, use adjacency strategy:

fgumi group \
  --input sorted.bam \
  --output grouped.bam \
  --strategy adjacency

To collect all grouping QC metrics under a single prefix:

fgumi group \
  --input sorted.bam \
  --output grouped.bam \
  --strategy adjacency \
  --metrics group_metrics

This writes group_metrics.family_sizes.txt, group_metrics.grouping_metrics.txt, and group_metrics.position_group_sizes.txt in one step.

See UMI Grouping for details on grouping strategies.

5. Call Consensus Reads

Choose the consensus calling method based on your library preparation:

Simplex consensus (single-strand):

fgumi simplex \
  --input grouped.bam \
  --output consensus.bam

Duplex consensus (double-strand):

fgumi duplex \
  --input grouped.bam \
  --output duplex.bam

CODEC consensus:

fgumi codec \
  --input grouped.bam \
  --output codec_consensus.bam

See Consensus Calling and Duplex Consensus Calling for details.

6. (Optional) Collect QC Metrics

Collect QC metrics before filtering to understand your library.

For simplex libraries, use simplex-metrics on the grouped BAM:

fgumi simplex-metrics \
  --input grouped.bam \
  --output simplex_metrics

For duplex libraries, use duplex-metrics on the grouped BAM:

fgumi duplex-metrics \
  --input grouped.bam \
  --output duplex_metrics

Both commands write a set of metrics files under the given output prefix. See Working with Metrics for details on interpreting the output.

7. Filter Consensus Reads

Filter consensus reads based on quality metrics. The --min-reads format depends on the consensus type:

For simplex consensus (single integer):

fgumi filter \
  --input consensus.bam \
  --output filtered.bam \
  --ref ref.fa \
  --min-reads 1

For duplex consensus (three comma-separated values: duplex,AB,BA):

fgumi filter \
  --input duplex.bam \
  --output filtered.bam \
  --ref ref.fa \
  --min-reads 1,1,1

8. (Optional) Clip Overlapping Reads

Clip overlapping bases in read pairs to avoid double-counting evidence:

fgumi clip \
  --input filtered.bam \
  --output clipped.bam \
  --ref ref.fa

What’s Next

Best Practices — recommended parameter settings and pipeline configuration
Performance Tuning — threading, memory, and compression optimization
Working with Metrics — understanding fgumi’s output metrics

Keyboard shortcuts

fgumi