Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

extract

Category: UMI EXTRACTION

Extract UMIs from FASTQ and create unmapped BAM

Description

Generates an unmapped BAM file from FASTQ files with UMI extraction.

Takes in one or more FASTQ files (optionally gzipped), each representing a different sequencing read (e.g. R1, R2, I1 or I2) and can use a set of read structures to allocate bases in those reads to template reads, sample indices, unique molecular indices, or to designate bases to be skipped over.

Only template bases will be retained as read bases (stored in the SEQ field) as specified by the read structure.

Read Structures

Read structures are made up of <number><operator> pairs much like the CIGAR string in BAM files. Five kinds of operators are recognized:

  1. T identifies a template read
  2. B identifies a sample barcode read
  3. M identifies a unique molecular index read
  4. C identifies a cell barcode read
  5. S identifies a set of bases that should be skipped or ignored

The last <number><operator> pair may be specified using a + sign instead of number to denote “all remaining bases”. This is useful if, e.g., FASTQs have been trimmed and contain reads of varying length.

For example, to convert a paired-end run with an index read and where the first 5 bases of R1 are a UMI and the second five bases are monotemplate:

fgumi extract –input r1.fq r2.fq i1.fq –read-structures 5M5S+T +T +B

Alternatively, if reads are fixed length:

fgumi extract –input r1.fq r2.fq i1.fq –read-structures 5M5S65T 75T 8B

UMI Extraction

A read structure should be provided for each read of a template. For paired end reads, two read structures should be specified. The tags to store the molecular indices will be associated with the molecular index segment(s) in the read structure based on the order specified. If only one molecular index tag is given, then the molecular indices will be concatenated and stored in that tag. In the resulting BAM file each end of a pair will contain the same molecular index tags and values.

UMIs may be extracted from the read sequences, the read names, or both. If --extract-umis-from-read-names is specified, any UMIs present in the read names are extracted; read names are expected to be :-separated and the UMI is taken from the last field. At least 8 fields must be present — the standard Illumina shape @<instrument>:<run>:<flowcell>:<lane>:<tile>:<x>:<y>:<UMI>. Names with 9+ fields (e.g. produced by demultiplexers that fold the sample index into the colon-separated portion) are also handled, with the UMI still coming from the last field. Any + characters in the extracted UMI are normalized to -. If UMI segments are present in the read structures those will also be extracted. If UMIs are present in both, the final UMIs are constructed by first taking the UMIs from the read names, then adding a hyphen, then the UMIs extracted from the reads.

Arguments

FlagDescriptionDefault
-i, --inputs <INPUTS>Input FASTQ files corresponding to each sequencing read (e.g. R1, I1, etc.)required
-o, --output <OUTPUT>Output BAM file to be writtenrequired
-r, --read-structures <READ_STRUCTURES>Read structures, one for each of the FASTQs (optional if 1-2 template-only FASTQs)
-q, --store-umi-quals <STORE_UMI_QUALS>Store UMI base quality scores in the QX SAM tag
-C, --store-cell-quals <STORE_CELL_QUALS>Store cell barcode base quality scores in the CY SAM tag
-Q, --store-sample-barcode-qualities <STORE_SAMPLE_BARCODE_QUALITIES>Store the sample barcode qualities in the QT Tag
-n, --extract-umis-from-read-names <EXTRACT_UMIS_FROM_READ_NAMES>Extract UMI(s) from read names and prepend to UMIs from reads
-a, --annotate-read-names <ANNOTATE_READ_NAMES>Annotate read names with UMIs (appends “+UMIs” to read names)
-s, --single-tag <SINGLE_TAG>Single tag to store all concatenated UMIs (in addition to per-segment tags)
--clipping-attribute <CLIPPING_ATTRIBUTE>Tag containing adapter clipping position to adjust (e.g. ‘XT’ from MarkIlluminaAdapters)
--read-group-id <READ_GROUP_ID>Read group ID to use in the file headerA
--sample <SAMPLE>The name of the sequenced samplerequired
--library <LIBRARY>The name/ID of the sequenced libraryrequired
-b, --barcode <BARCODE>Library or Sample barcode sequence
--platform <PLATFORM>Sequencing Platformillumina
--platform-unit <PLATFORM_UNIT>Platform unit (e.g. ‘flowcell-barcode.lane.sample-barcode’)
--platform-model <PLATFORM_MODEL>Platform model to insert into the group header (ex. miseq, hiseq2500, hiseqX)
--sequencing-center <SEQUENCING_CENTER>The sequencing center from which the data originated
--predicted-insert-size <PREDICTED_INSERT_SIZE>Predicted median insert size, to insert into the read group header
--description <DESCRIPTION>Description of the read group
--comment <COMMENT>Comment(s) to include in the output file’s header
--run-date <RUN_DATE>Date the run was produced, to insert into the read group header
--threads <THREADS>Number of threads for the multi-threaded pipeline
--compression-level <COMPRESSION_LEVEL>Compression level for output BAM (0-12)1
--scheduler <SCHEDULER>Scheduler strategy for thread work assignmentbalanced-chase-drain
--pipeline-stats <PIPELINE_STATS>Print detailed pipeline statistics at completionfalse
--deadlock-timeout <DEADLOCK_TIMEOUT>Timeout in seconds for deadlock detection (default: 10, 0 = disabled)10
--deadlock-recover <DEADLOCK_RECOVER>Enable automatic deadlock recovery (default: false, detection only)false
--queue-memory <QUEUE_MEMORY>Pipeline queue memory limit per thread (default) or total768
--queue-memory-per-thread <QUEUE_MEMORY_PER_THREAD>Interpret –queue-memory as per-thread (true, default) or total (false)true
--queue-memory-limit-mb <QUEUE_MEMORY_LIMIT_MB>DEPRECATED: Use –queue-memory instead. Memory limit for pipeline queues in megabytes
--async-reader <ASYNC_READER>Wrap FASTQ inputs in a userspace async prefetch reader. Dedicates one OS thread per input stream to issue reads ahead of decompression/parsing. Hidden experimental flagfalse