Manual Reference Pages  - minigraph (1)

NAME

minigraph - sequence-to-graph mapping and incremental sequence graph generation

CONTENTS

Synopsis
Description
Options
     Indexing options
     Mapping options
     Graph generation options
     Input/output options
     Preset options
     Miscellaneous options
Output Format
Limitations
See Also

SYNOPSIS

* Sequence-to-graph mapping:
minigraph [-x preset] [-t nThreads] graph.gfa query1.fa [...] > out.gaf

* Incremental graph generation:

minigraph -x ggs [-t nThreads] initGraph.gfa sample1Asm.fa [...] > finalGraph.gfa

DESCRIPTION

Minigraph is a proof-of-concept sequence-to-graph mapper and graph constructor. It finds approximate locations of a query sequence in a sequence graph and incrementally augments an existing graph with long query subsequences.

OPTIONS

    Indexing options

-k INT Minimizer k-mer length [15]
-w INT Minimizer window size [10]. A minimizer is the smallest k-mer in a window of w consecutive k-mers.

    Mapping options

-U INT1[,INT2]
  Choose the minimizer occurrence threshold within this interval [50,250]
-f FLOAT Ignore top FLOAT fraction of repetitive minimizers [0.0002]. If this threshold falls within the interval set by -U, it will be the final threshold; otherwise the lower or the upper bound of -U will be applied.
-j FLOAT Expected query-graph sequence divergence [0.1]
-g NUM Stop chain enlongation if there are no minimizers within INT-bp [10k]. K/k/M/m suffixes are recognized.
-r NUM Bandwidth used in chaining [2k]. This option approximately controls the maximum gap size.
-n INT1[,INT2]
  Drop graph chains consisting of <INT1 minimizers and drop linear chains consisting of <INT2 minimizers [5,3]
-m INT1[,INT2]
  Drop graph chains with graph chaining score <INT1 and drop linear chains with linear chaining score <INT2 [50,30]. Linear chaining score equals the approximate number of matching bases minus a weak concave gap penalty. Graph chaining score uses a linear gap penalty.
-p FLOAT Minimal secondary-to-primary score ratio to output secondary mappings [0.8]. Between two chains overlaping over half of the shorter chain (controlled by -M), the chain with a lower score is secondary to the chain with a higher score.
-N INT Output at most INT secondary mappings [5]. This option has no effect when -P is applied.
-P Retain all chains and don’t attempt to set primary chains. Options -p and -N have no effect when this option is in use.
-M FLOAT Mark as secondary a chain that overlaps with a better chain by FLOAT or more of the shorter chain [0.5]
--max-gap-pre NUM
  Similar to -g but used for prefiltering [1000]
--max-lc-iter NUM
  max number of iterations for linear chaining [10000]
--max-rmq-size NUM
  max size of the RMQ tree [100000]
--max-lc-skip INT
  A heuristics that stops linear chaining early [25]
--max-gc-skip INT
  Similar to --max-lc-skip but applied to graph chaining [25]
--ref-bonus INT
  Bonus for a reference subwalk [0]
--min-cov-blen NUM
  Minimum alignment block length to count [1k]
--min-cov-mapq INT
  Minimum mapping quality to count [20]

    Graph generation options

--ggen=[simple]
  Graph generation algorithm. So far only a simple algorithm is implemented [simple]. With this option, all query sequences are loaded into memory.
--cov Remap and generate segment and link use frequencies. This option triggers GFA output. When used with --ggen, minigraph writes the frequency of link uses and the average breadth of coverage of each segment to the cf tag. When used without --ggen, minigraph writes the count of link uses and the average depth of coverage of each segment to the dc tag.
-q INT Minimum mapping quality [5]
-l NUM Minimum chain length to consider [50k]
-d NUM Minimum chain length for depth calculation [10k]
-L INT Minimum insertion length [250]
--gg-match-pen INT
  Penalty for a pair of matching anchors [5]. Larger value for more fragmented inserts.
--ins-qovlp=yes|no
  Forcefully resolve query overlaps [no]
--inv=yes|no
  Generate graphs with inversions or not [yes]

    Input/output options

-o FILE Output alignments to FILE [stdout].
-t INT Number of threads [4]. Minigraph uses at most three threads when indexing target sequences, and uses up to INT+1 threads when mapping (the extra thread is for I/O, which is frequently idle and takes little CPU time).
-K NUM Number of bases loaded into memory to process in a mini-batch [500M]. K/M/G/k/m/g suffix is accepted. A large NUM helps load balancing in the multi-threading mode, at the cost of increased memory. This option has no effect if --ggen is applied.
--vc In output GAF, show mapping paths in the unstable segment coordinate.
-S Output linear chains in the format of: ‘*’ segName segLen nMinimizer seqDiv segStart segEnd qStart qEnd
--write-mz Output linear chains in the format of: ‘*’ segName segLen nMinimizer seqDiv segStart segEnd qStart qEnd k-mer segOffsets qOffsets. segOffsets and qOffsets are comma-separated lists with each consisting of nMinimizer-1 integers which give the distance from the previous minimizer on segments and query, respectively.
--secondary=yes|no
  Whether to output secondary alignments [no]
--show-unmap=yes|no
  Print unmapped query sequences in GAF [no]
--version Print version number to stdout

    Preset options

-x STR Preset []. This option applies multiple options at the same time. Other options on the command line will always override values set by -x. Available STR are:
lr Mapping noisy long reads (-k15 -w10 -j.1 -g5k -r2k --min-cov-blen=1000). This is the same as the default setting.
sr Mapping short single-end or paired-end reads (-k21 -w10 -U1000,2500 -g100 -r100 -p.5 -n3,2 -m40,25 --heap-sort=yes -K50m --frag --ref-bonus=1 --min-cov-blen=50). Paired-end mapping is not supported.
asm Mapping long contigs or high-quality CCS reads (-k19 -w10 -j.01 -g100k -r100k --max-gap-pre=10k -n5,3 -m1000,40 -K4g --max-lc-skip=50 --max-gc-skip=50 --min-cov-mapq=5 --min-cov-blen=100k).
ggs Simple algorithm for incremental graph generation (-xasm --ggen=simple).

    Miscellaneous options

--no-kalloc
  Use the libc default allocator instead of the kalloc thread-local allocator. This debugging option is mostly used with Valgrind to detect invalid memory accesses. Minigraph runs slower with this option, especially in the multi-threading mode.

OUTPUT FORMAT

Minigraph outputs mapping positions in the Graph mApping Format (GAF) by default. GAF is a TAB-delimited text format with each line consisting of at least 12 fields as are described in the following table:

ColTypeDescription
1stringQuery sequence name
2intQuery sequence length
3intQuery start coordinate (0-based; closed)
4intQuery end coordinate (0-based; open)
5char‘+’ if query/path on the same strand; ‘-’ if opposite
6stringPath matching /([><][^\s><]+(:\d+-\d+)?)+|([^\s><]+)/
7intPath sequence length
8intPath start coordinate
9intPath end coordinate
10intNumber of matching bases in the mapping
11intNumber bases, including gaps, in the mapping
12intMapping quality (0-255 with 255 for missing)

When alignment is available, column 11 gives the total number of sequence matches, mismatches and gaps in the alignment; column 10 divided by column 11 gives the BLAST-like alignment identity. When alignment is unavailable, these two columns are approximate. PAF may optionally have additional fields in the SAM-like typed key-value format. Minigraph may output the following tags:

TagTypeDescription
tpAType of aln: P/primary and S/secondary
cmiNumber of minimizers on the chain
s1iChaining score
s2iChaining score of the best secondary chain
dvfApproximate per-base sequence divergence
cffAvg. segment breadth of coverage and link use freq
dcfAvg. segment depth of coverage and link use counts
qlB,iLengths of single-end reads

LIMITATIONS

* Minigraph needs to find strong colinear chains first. For a graph consisting of many short segments (e.g. one generated from rare SNPs in large populations), minigraph will fail to map query sequences.
* When connecting colinear chains on graphs, minigraph doesn’t take full advantage of base sequences and may miss the optimal alignments.
* Minigraph doesn’t give base-level alignment.
* Minigraph only inserts segments contained in long graph chains. This conservative strategy helps to build relatively accurate graph, but may miss more complex events. Other strategies may be explored in future.

SEE ALSO

minimap2(1), gfatools(1).


minigraph-0.13 (r397) minigraph (1) 3 December 2020