Performance Tuning Guide
fgumi provides three key options to optimize performance for your system: threading, memory management, and compression. This guide explains how to configure these options for different scenarios.
Coming from fgbio?
If you’re used to fgbio’s JVM-based memory model (java -Xmx4g), there are important differences in how fgumi manages memory:
| fgbio (JVM) | fgumi | |
|---|---|---|
| Memory control | -Xmx sets a hard ceiling on the entire process | --queue-memory controls pipeline queue backpressure |
| Enforcement | Hard limit — JVM throws OutOfMemoryError at the ceiling | Soft limit — triggers backpressure to slow producers |
| Scope | Total process memory (heap + off-heap) | Queue memory only; does not cover UMI data structures, decompressors, thread stacks, or working buffers |
| Scaling | Fixed regardless of threads | Per-thread by default (--queue-memory 768 --threads 8 = ~6 GB) |
| Recommendation | Set once and forget | Monitor RSS and adjust; use --queue-memory-per-thread false for a fixed total budget |
Key takeaway: fgumi’s actual process memory (RSS) will be higher than the --queue-memory value. When estimating memory needs, account for:
- Queue memory (controlled by
--queue-memory) - UMI grouping data structures (scales with UMI diversity and position depth)
- Per-thread decompressor and compressor instances
- Thread stacks and I/O buffers
For memory-constrained environments, start with --queue-memory-per-thread false and a conservative total budget, then increase if throughput is too low.
Threading Options
No-flag Fast Path (default)
- Usage: Omit
--threadsentirely - Behavior: Uses optimized single-threaded fast path with minimal overhead
- Best for: Small files, memory-constrained systems, debugging
Explicit Single-threaded Mode
- Usage:
--threads 1 - Behavior: Uses the unified pipeline with a single worker thread — same pipeline as
--threads Nbut with N=1; does not use the no-flag fast path - Best for: Isolating pipeline behavior in a single-threaded context
Multi-threaded Mode
- Usage:
--threads Nwhere N > 1 - Behavior: Uses unified 7-step pipeline with work-stealing scheduler
- Best for: Large files, high-performance systems, production workloads
Memory Management
fgumi’s unified memory management controls pipeline queue memory to prevent out-of-memory conditions while maintaining throughput.
Queue Memory Options
# Basic usage (768MB per thread - default)
fgumi filter --queue-memory 768 --queue-memory-per-thread true
# Human-readable formats
fgumi filter --queue-memory 2GB
fgumi filter --queue-memory 1024MiB
# Fixed total memory (no per-thread scaling)
fgumi filter --queue-memory 4096 --queue-memory-per-thread false
Memory Scaling Behavior
| Threads | Per-thread Mode | Fixed Mode |
|---|---|---|
| 1 | 768MB | 768MB |
| 4 | 3GB | 768MB |
| 8 | 6GB | 768MB |
| 16 | 12GB | 768MB |
Memory Validation
- System check: Warns if requesting >90% of available system memory
- Overflow protection: Prevents integer overflow with checked arithmetic
- Decimal support: Accepts formats like
1.5GBin addition to integers
Compression Options
Compression Level
- Range: 1 (fastest) to 12 (best compression)
- Default: 1 (fastest) for most commands;
fgumi mergedefaults to 6 - Usage:
--compression-level N
Compression Threading
- Default: Matches
--threadssetting - Override:
--compression-threads N - Best practice: Usually leave at default
I/O and Storage Tuning
For sequential workloads like BAM and FASTQ processing, I/O throughput is often the bottleneck — not CPU. Two areas to check: OS readahead and volume throughput.
OS Readahead
The Linux kernel prefetches file data into the page cache ahead of the application. The default readahead window is typically 128 KB, which fgumi’s decompression threads can easily outpace. When that happens the processing thread stalls waiting on disk.
Check the current readahead (in 512-byte sectors):
blockdev --getra /dev/nvme1n1 # e.g. 256 = 128 KB
For sequential BAM/FASTQ workloads, increasing to 4 MB eliminates most I/O stalls:
# 4 MB = 8192 sectors (requires root)
sudo blockdev --setra 8192 /dev/nvme1n1
This setting does not persist across reboots. Add it to a startup script or udev rule if needed.
--async-reader (Experimental)
When you cannot tune OS readahead — containers, managed cloud instances, network
mounts — --async-reader provides a similar benefit from userspace. It spawns a
dedicated I/O thread that reads raw bytes into a bounded queue ahead of the
decompression step, so processing threads do not block on disk.
fgumi group \
--async-reader \
--threads 8 \
--input reads.bam \
--output grouped.bam
--async-reader works with all input types: BAM files, BGZF/gzip/plain FASTQs,
and piped stdin. It is supported by all commands that read BAM/FASTQ input,
including sort. It is most effective when I/O latency is high (network storage,
cold page cache, small OS readahead). On systems where you can already set 4 MB+
readahead, the additional benefit is modest.
AWS EBS Volume Throughput
On AWS, gp3 volumes default to 125 MB/s throughput regardless of size. For BAM
processing this is often the binding constraint. Increasing to 300-500 MB/s is
inexpensive and has a large impact:
# Increase throughput on an existing volume (takes effect within minutes)
aws ec2 modify-volume \
--volume-id vol-0123456789abcdef0 \
--throughput 500
For sustained sequential I/O, also consider increasing IOPS (default 3000) if your
reads are small. Monitor with iostat -x 1 to confirm the volume is the bottleneck
before spending on higher provisioned throughput.
Scenario-Based Configurations
High-Throughput Server
Goal: Maximum processing speed for large datasets
fgumi filter \
--threads 16 \
--queue-memory 1GB \
--compression-level 3 \
--input large_dataset.bam \
--output filtered.bam
Rationale:
- High thread count for parallel processing
- Generous memory for pipeline buffers
- Lower compression for speed
Memory-Constrained Node
Goal: Minimize memory usage while maintaining reasonable performance
fgumi filter \
--threads 8 \
--queue-memory 512 \
--queue-memory-per-thread false \
--compression-level 6 \
--input dataset.bam \
--output filtered.bam
Rationale:
- Moderate thread count
- Fixed memory limit (512MB total)
- Default compression for balance
Fast Local SSD
Goal: Optimize for fast I/O with minimal compression overhead
fgumi filter \
--threads 8 \
--queue-memory 2GB \
--compression-level 1 \
--input dataset.bam \
--output filtered.bam
Rationale:
- High memory for large pipeline buffers
- Minimal compression (I/O not bottleneck)
Network Storage
Goal: Minimize network I/O with maximum compression
fgumi filter \
--async-reader \
--threads 4 \
--queue-memory 512 \
--compression-level 9 \
--input dataset.bam \
--output filtered.bam
Rationale:
--async-readerhides network I/O latency (see I/O and Storage Tuning)- Moderate threading to avoid overwhelming network
- Conservative memory usage
- Maximum compression to reduce network transfer
Development/Testing
Goal: Fast iteration with minimal resource usage
fgumi filter \
--queue-memory 256 \
--compression-level 1 \
--input small_test.bam \
--output test_output.bam
Rationale:
- Single-threaded for simplicity
- Minimal memory footprint
- Fast compression for quick turnaround
Verbose Logging
Use --verbose (or -v) to enable debug-level logging for any command:
fgumi group --verbose --input reads.bam --output grouped.bam
This is equivalent to setting RUST_LOG=debug. If RUST_LOG is explicitly set, it takes precedence over --verbose.
Advanced Pipeline Options
The following options are available on all multi-threaded pipeline commands. They are hidden from the default help text but can be useful for debugging and performance analysis.
Pipeline Statistics
fgumi group --pipeline-stats --input reads.bam --output grouped.bam
Prints detailed per-step timing, throughput, contention metrics, and per-thread work distribution at completion.
Scheduler Strategy
fgumi group --scheduler balanced-chase-drain --input reads.bam --output grouped.bam
Controls which scheduling strategy threads use for work assignment. The default (balanced-chase-drain) is recommended for most workloads. Available strategies:
| Strategy | Description |
|---|---|
balanced-chase-drain | Default. Balanced work distribution with output drain mode. |
fixed-priority | Static thread roles (reader, writer, workers). Simple baseline. |
chase-bottleneck | Threads dynamically follow work through the pipeline. |
Other experimental strategies are available (thompson-sampling, ucb, epsilon-greedy, etc.) but are not recommended for production use.
Deadlock Detection
# Adjust timeout (default: 10 seconds, 0 to disable)
fgumi group --deadlock-timeout 30 --input reads.bam --output grouped.bam
# Enable automatic recovery (default: detection only)
fgumi group --deadlock-recover --input reads.bam --output grouped.bam
The pipeline monitors for progress stalls. When no queue operations succeed for the timeout duration, diagnostic information is logged (queue depths, memory usage, per-queue timestamps).
With --deadlock-recover, the pipeline progressively doubles queue memory limits (2x, 4x, up to 8x) to resolve backpressure deadlocks, then restores original limits after 30 seconds of sustained progress.
Performance Monitoring
Memory Usage
- Monitor system memory usage during execution
- Watch for “exceeds available memory” warnings
- Adjust
--queue-memoryif seeing swap activity
Thread Utilization
- Use
htopor similar to monitor CPU usage - All threads should show activity during processing
- Consider reducing threads if not fully utilized
I/O Patterns
- Monitor disk I/O with
iotoporiostat -x 1 - If threads are idle waiting on I/O, increase OS readahead or try
--async-reader(see I/O and Storage Tuning) - Network storage may benefit from lower thread counts
- SSD storage can handle higher thread counts
Troubleshooting
Out of Memory Errors
- Reduce
--queue-memory - Set
--queue-memory-per-thread falsefor fixed limits - Reduce
--threads
Poor Performance
- Increase
--threadsif CPU usage is low - Increase
--queue-memoryif I/O bound - Reduce
--compression-levelif CPU bound - Check OS readahead and EBS throughput if disk I/O is the bottleneck (see I/O and Storage Tuning)
Pipeline Appears Stuck
If a command hangs without producing output:
- Check if a deadlock warning appears in the log (default timeout: 10 seconds)
- Run with
--verboseto see detailed pipeline activity - Run with
--pipeline-statsto see per-step metrics at completion - Try
--deadlock-recoverto allow automatic recovery from backpressure deadlocks - Reduce
--threads— fewer threads means simpler scheduling and less contention
System Memory Warnings
Requested memory 16GB exceeds 90% of system memory (14.4GB)
- Reduce memory allocation or add more RAM
- Consider using
--queue-memory-per-thread false
Command-Specific Considerations
Extract
- Benefits from high memory (large FASTQ processing)
- Compression level affects output size significantly
Zipper
- For best throughput, pipe uncompressed BAM from the aligner (e.g.
bwa-mem3 mem --bam=0). Uncompressed BAM skips SAM text formatting on the aligner side and SAM parsing on the zipper side, and adds only ~26 bytes of BGZF framing per ~64 KiB block - SAM input is fine for aligners that can’t emit BAM; compressed BAM on a pipe wastes CPU on both ends for data the sort step will re-compress anyway
- The zipper pipeline uses raw-byte merging internally: aligned records are not fully decoded and re-encoded unless the record actually needs modification, which eliminates a significant CPU bottleneck on high-throughput runs
Sort
- Uses an internal LoserTree (tournament tree) for k-way merging, which performs significantly better than a simple heap merge when the number of sorted runs is large
--max-memorycontrols how much RAM is used for sort buffers; increase for large files to reduce the number of intermediate merge passes- For template-coordinate sort with single-cell data, the
CBtag is included automatically --async-readeris supported and can improve Phase 1 (input reading) throughput when disk latency is high or the OS page cache readahead is small
Merge
fgumi mergeperforms a k-way merge using a LoserTree for efficient multi-file merging- Thread count (
--threads) controls compression parallelism, not merge concurrency - For template-coordinate merges with single-cell data, the
CBtag is included automatically
Group/Dedup
- Memory usage scales with UMI diversity and the number of reads at any given position
- Higher thread counts improve UMI processing
- The
--metrics PREFIXflag writes all grouping metrics in one step with minimal overhead
Simplex/Duplex Metrics
- Both
simplex-metricsandduplex-metricsare single-threaded; they do not benefit from--threads - Memory usage is proportional to the number of unique genomic positions in the input
Consensus (Simplex/Duplex/CODEC)
- Memory proportional to family sizes
- Benefits from balanced threading and memory
Filter
- Streaming operation benefits from pipeline memory
- Compression affects final output size
Migration from Legacy Parameters
If using deprecated --queue-memory-limit-mb:
# Old (deprecated)
fgumi group --queue-memory-limit-mb 4096
# New (recommended)
fgumi group --queue-memory 4096 --queue-memory-per-thread false
The new parameters provide better control and human-readable formats while maintaining backward compatibility.