Manual Reference Pages  - miniprot (1)


miniprot - protein-to-genome alignment with splicing and frameshifts


     Indexing options
     Chaining options
     Alignment options
     Input/Output options
Output Format
     The GFF3 Format
     The PAF Format


* Indexing a genome (recommended as indexing can be slow and memory hungry):
miniprot [-t nThreads] -d ref.mpi ref.fna

* Aligning proteins to a genome:

miniprot [-t nThreads] ref.mpi protein.faa > output.paf
miniprot [-t nThreads] ref.fna protein.faa > output.paf


Miniprot aligns protein sequences to a genome allowing potential frameshifts and splicing.


    Indexing options

-k INT K-mer size for genome-wide indexing [6]
-s INT Syncmer submer size [4]. In average, miniprot selects a k-mer every 2*(k-s)+1 residues.
-b INT Number of bits per bin [8]. Miniprot splits the genome into non-overlapping bins of 2^8 bp in size.
-d FILE Write the index to FILE [].

    Chaining options

-S Disable splicing. It applies ‘-G1k -J1k -e1k’ at the same time.
-c NUM Ignore k-mers occurring NUM times or more [50k]
-G NUM Max intron size [200k]
-n NUM Min number of syncmers in a chain [10]
-m NUM Min chaining score [0]
-l INT K-mer size for the second round of chaining [5]
-e NUM Max extension from chain ends for alignment or the second round of chaining [10k]
-p FLOAT Filter out a secondary chain/alignment if its score is FLOAT fraction of the best chain [0.5]
-N NUM Retain at most NUM number of secondary chains/alignments [30]

    Alignment options

-O INT Gap open penalty [11]
-E INT Gap extension penalty [1]. A gap of size g costs {-O}+{-E}*g.
-J INT Intron open penalty [29]
-F INT Penalty for frameshifts or in-frame stop codons [17]
-C FLOAT Weight of splicing penalty [1]. Set to 0 to ignore splicing signals.
-B IN Bonus score for alignment reaching ends of proteins [5]

    Input/Output options

-t INT Number of threads [4]
--gff Output in the GFF3 format. ‘##PAF’ lines in the output provide detailed alignments.
--gff-only Output in the GFF3 format without ‘##PAF’ lines.
-P STR Prefix for IDs in GFF3 [MP]. --gff-delim overrides this option.
--gff-delim CHAR
  Change the ID field in GFF3 to QueryNameCHARHitIndex []. If not specified, the default ID looks like ‘MP000012’.
-u Print unmapped query proteins
--outn NUM Output up to min{NUM, -N} alignments per query [1000].
-K NUM Query batch size [2M]


    The GFF3 Format

Miniprot outputs alignment in the extended Pairwise mApping Format (PAF) by default (see the next subsection). It can also output GFF3 with option --gff. Miniprot may output three features: ‘mRNA’, ‘CDS’ or ‘stop_codon’. Here, a stop_codon is only reported if the alignment reaches the C-terminus of the protein and the next codon is a stop codon. Per GenCode rule, stop_codon is not part of CDS but it is part of mRNA.

Miniprot may output the following attributes in GFF3:

IDstrmRNA identifier
ParentstrIdentifier of the parent feature
RankintRank among all hits of the query
IdentityrealFraction of exact amino acid matches
PositiverealFraction of positive amino acid matches
Donorstr2bp at the donor site if not GT
Acceptorstr2bp at the acceptor site if not AG
FrameshiftintNumber of frameshift events in alignment
StopCodonintNumber of in-frame stop codons
TargetstrProtein coordinate in alignment

    The PAF Format

PAF gives detailed alignment. It is a TAB-delimited text format with each line consisting of at least 12 fields as are described in the following table:

1stringProtein sequence name
2intProtein sequence length
3intProtein start coordinate (0-based)
4intProtein end coordinate (0-based)
5char‘+’ for forward strand; ‘-’ for reverse
6stringContig sequence name
7intContig sequence length
8intContig start coordinate on the original strand
9intContig end coordinate on the original strand
10intNumber of matching nucleotides
11intNumber of nucleotides in alignment excl. introns
12intMapping quality (0-255 with 255 for missing)

PAF may optionally have additional fields in the SAM-like typed key-value format. Miniprot may output the following tags:

ASiAlignment score from dynamic programming
msiAlignment score excluding introns
npiNumber of amino acid matches with positive scores
daiDistance to the nearest start codon
doiDistance to the nearest stop codon
cgiProtein CIGAR
csiDifference string

A protein CIGAR consists of the following operators:

nMAlignment match. Consuming n*3 nucleotides and n amino acids
nIInsertion. Consuming n amino acids
nDDelection. Consuming n*3 nucleotides
nFFrameshift deletion. Consuming n nucleotides
nGFrameshift match. Consuming n nucleotides and 1 amino acid
nNPhase-0 intron. Consuming n nucleotides
nUPhase-1 intron. Consuming n nucleotides and 1 amino acid
nVPhase-2 intron. Consuming n nucleotides and 1 amino acid

The cs tag encodes difference sequences. It consists of a series of operations:

:[0-9]+Number of identical amino acids
*[acgtn]+[A-Z*]Substitution: ref to query
+[A-Z]+# aa inserted to the reference
-[acgtn]+# nt deleted from the reference
~[acgtn]{2}[0-9]+[acgtn]{2}Intron length and splice signal


* The DP alignment score (the AS tag) is not accurate.
* Need to introduce more heuristics for improved accuracy.

miniprot-0.4 (r165) miniprot (1) 5 October 2022