Manual Reference Pages - miniprot (1)

NAME

miniprot - protein-to-genome alignment with splicing and frameshifts

Synopsis
Description
Options
Indexing options
Chaining options
Alignment options
Input/Output options
Output Format
The GFF3 Format
The PAF Format
Limitations

SYNOPSIS

* Indexing a genome (recommended as indexing can be slow and memory hungry):
miniprot [-t nThreads] -d ref.mpi ref.fna

* Aligning proteins to a genome:
miniprot [-t nThreads] ref.mpi protein.faa > output.paf
miniprot [-t nThreads] ref.fna protein.faa > output.paf

DESCRIPTION

Miniprot aligns protein sequences to a genome allowing potential frameshifts and splicing.

OPTIONS

Indexing options

-k INT K-mer size for genome-wide indexing [6]

-M INT Sample k-mers at a rate 1/2**INT [1]. Increasing this option reduces peak memory but decreases sensitivity.

-L INT Minimum ORF length to index [30]

-T INT NCBI translation table (1 through 33 except 7-8 and 17-20) [1]

-b INT Number of bits per bin [8]. Miniprot splits the genome into non-overlapping bins of 2^8 bp in size.

-d FILE Write the index to FILE [].

Chaining options

-S Disable splicing. It applies ‘-G1k -J1k -e1k’ at the same time.

-c NUM Ignore k-mers occurring NUM times or more [50k]

-G NUM Max intron size [200k]. This option overrides -I.

-I Set max intron size to min(max(3.6*sqrt(refLen),10000),300000) where refLen is the total length of the input genome.

-n NUM Min number of syncmers in a chain [10]

-m NUM Min chaining score [0]

-l INT K-mer size for the second round of chaining [5]

-e NUM Max extension from chain ends for alignment or the second round of chaining [10k]

-p FLOAT Filter out a secondary chain/alignment if its score is FLOAT fraction of the best chain [0.5]

-N NUM Retain at most NUM number of secondary chains/alignments [30]

Alignment options

-O INT Gap open penalty [11]

-E INT Gap extension penalty [1]. A gap of size g costs {-O}+{-E}*g.

-J INT Intron open penalty [29]

-F INT Penalty for frameshifts or in-frame stop codons [23]

-C FLOAT Weight of splicing penalty [1]. Set to 0 to ignore splicing signals.

-B IN Bonus score for alignment reaching ends of proteins [5]

-j INT Splice model for the target genome: 2=vertebrate/insect, 1=general, 0=none [1]. The vertebrate/insect model considers ‘G|GTR...YYYNYAG|’ as the optimal splicing sequence and penalizes other sequences based on profiles in Sibley et al (2016). According to Irimia and Roy (2008) and Sheth et al (2006), the first ‘G’ in the donor exon and the poly-Y close to the acceptor may not be conserved in some species. The general model takes ‘|GTR...YAG|’ as the optimal sequence. Both models also slightly prefer less frequent splice sites including ‘G|GC...YAG|’ and ‘|AT...AC|’.

--spsc FILE Splice score file []. Each line is TAB-delimited, consisting of contig name, offset of the splice junction, strand (‘+’ or ‘-’), donor or acceptor (‘D’ or ‘A’) and an integer score. The score is added the donor/acceptor score function. It can be positive or negative and needs to be compatible with the scoring system. This option additionally increases -J and --J2 by 10 unless they are specified on the command line.

--spsc0 INT Splice score for positions not in the --spsc file [-7]. This option has no effect if --spsc is not specified.

--spsc-max INT
Cap splice scores to INT [14].

--io-coef FLOAT
Logarithm intron length penalty (EXPERIMENTAL) [0.5]

Input/Output options

-t INT Number of threads [4]

--gff Output in the GFF3 format. ‘##PAF’ lines in the output provide detailed alignments.

--gff-only Output in the GFF3 format without ‘##PAF’ lines.

--aln Output the residue alignment in three lines, where line ‘##ATN’ for the target nucleotide sequence, ‘##ATA’ for translated amino acid sequence and ‘##AQA’ for the query protein sequence. On a ‘##ATA’ line, ‘!’ denotes a frameshift insertion corresponding to the ‘F’ CIGAR operator and ‘$’ denotes a frameshift substitution corresponding to the ‘G’ operator.

--trans Output translated protein sequences on ‘##STA’ lines.

--no-cs Do not output the cs tag

--max-intron-out NUM
In the --aln format, if an intron is longer than NUM, only output ceil(NUM/2) basepairs at the donor or the acceptor sites and write the full intron length LEN as ~LEN~ in the middle [200].

-P STR Prefix for IDs in GFF3 or GTF [MP]. --gff-delim overrides this option.

--gff-delim CHAR
Change the ID field in GFF3 to QueryNameCHARHitIndex []. If not specified, the default ID looks like ‘MP000012’. This option is only applicable to the GFF3 output format.

--gtf Output in the GTF format

-u Print unmapped query proteins

--outn NUM Output up to min{NUM, -N} alignments per query [1000].

--outs FLOAT
Output an alignment only if its score is at least FLOAT*bestScore, where bestScore is the best alignment score of the protein [0.99]

--outc FLOAT
Output an alignment only if FLOAT fraction of the query protein is aligned [0.1]

-K NUM Query batch size [2M]

OUTPUT FORMAT

The GFF3 Format

Miniprot outputs alignment in the extended Pairwise mApping Format (PAF) by default (see the next subsection). It can also output GFF3 with option --gff. Miniprot may output three features: ‘mRNA’, ‘CDS’ or ‘stop_codon’. Here, a stop_codon is only reported if the alignment reaches the C-terminus of the protein and the next codon is a stop codon. Per GenCode rule, stop_codon is not part of CDS but it is part of mRNA or exon.
Miniprot may output the following attributes in GFF3:

Attribute Type Description

ID str mRNA identifier

Parent str Identifier of the parent feature

Rank int Rank among all hits of the query

Identity real Fraction of exact amino acid matches

Positive real Fraction of positive amino acid matches

Donor str 2bp at the donor site if not GT

Acceptor str 2bp at the acceptor site if not AG

Frameshift int Number of frameshift events in alignment

StopCodon int Number of in-frame stop codons

Target str Protein coordinate in alignment

The PAF Format

PAF gives detailed alignment. It is a TAB-delimited text format with each line consisting of at least 12 fields as are described in the following table:

Col Type Description

1 string Protein sequence name

2 int Protein sequence length

3 int Protein start coordinate (0-based)

4 int Protein end coordinate (0-based)

5 char ‘+’ for forward strand; ‘-’ for reverse

6 string Contig sequence name

7 int Contig sequence length

8 int Contig start coordinate on the original strand

9 int Contig end coordinate on the original strand

10 int Number of matching nucleotides

11 int Number of nucleotides in alignment excl. introns

12 int Mapping quality (0-255 with 255 for missing)

PAF may optionally have additional fields in the SAM-like typed key-value format. Miniprot may output the following tags:

Tag Type Description

AS i Alignment score from dynamic programming

ms i Alignment score excluding introns

np i Number of amino acid matches with positive scores

fs i Number of frameshifts

st i Number of in-frame stop codons

da i Distance to the nearest start codon

do i Distance to the nearest stop codon

cg Z Protein CIGAR

cs Z Difference string

A protein CIGAR consists of the following operators:

Op Description

nM Alignment match. Consuming n*3 nucleotides and n amino acids

nI Insertion. Consuming n amino acids

nD Deletion. Consuming n*3 nucleotides

nF Frameshift deletion. Consuming n nucleotides

nG Frameshift match. Consuming n nucleotides and 1 amino acid

nN Phase-0 intron. Consuming n nucleotides

nU Phase-1 intron. Consuming n nucleotides and 1 amino acid

nV Phase-2 intron. Consuming n nucleotides and 1 amino acid

The cs tag encodes difference sequences. It consists of a series of operations:

Op Regex Description

: [0-9]+ Number of identical amino acids

* [acgtn]+[A-Z*] Substitution: ref to query

+ [A-Z]+ # aa inserted to the reference

- [acgtn]+ # nt deleted from the reference

~ [acgtn]{2}[0-9]+[acgtn]{2} Intron length and splice signal

LIMITATIONS

* The initial conditions of dynamic programming are not technically correct, which may result in suboptimal residue alignment in rare cases.

* Support for non-splicing alignment needs to be improved.

miniprot-0.17 (r279)

miniprot (1)

15 June 2025

-k INT	K-mer size for genome-wide indexing [6]
-M INT	Sample k-mers at a rate 1/2**INT [1]. Increasing this option reduces peak memory but decreases sensitivity.
-L INT	Minimum ORF length to index [30]
-T INT	NCBI translation table (1 through 33 except 7-8 and 17-20) [1]
-b INT	Number of bits per bin [8]. Miniprot splits the genome into non-overlapping bins of 2^8 bp in size.
-d FILE	Write the index to FILE [].

-S	Disable splicing. It applies ‘-G1k -J1k -e1k’ at the same time.
-c NUM	Ignore k-mers occurring NUM times or more [50k]
-G NUM	Max intron size [200k]. This option overrides -I.
-I	Set max intron size to min(max(3.6sqrt(refLen),10000),300000) where refLen* is the total length of the input genome.
-n NUM	Min number of syncmers in a chain [10]
-m NUM	Min chaining score [0]
-l INT	K-mer size for the second round of chaining [5]
-e NUM	Max extension from chain ends for alignment or the second round of chaining [10k]
-p FLOAT	Filter out a secondary chain/alignment if its score is FLOAT fraction of the best chain [0.5]
-N NUM	Retain at most NUM number of secondary chains/alignments [30]

-O INT	Gap open penalty [11]
-E INT	Gap extension penalty [1]. A gap of size g costs {-O}+{-E}*g.
-J INT	Intron open penalty [29]
-F INT	Penalty for frameshifts or in-frame stop codons [23]
-C FLOAT	Weight of splicing penalty [1]. Set to 0 to ignore splicing signals.
-B IN	Bonus score for alignment reaching ends of proteins [5]
-j INT	Splice model for the target genome: 2=vertebrate/insect, 1=general, 0=none [1]. The vertebrate/insect model considers ‘G\|GTR...YYYNYAG\|’ as the optimal splicing sequence and penalizes other sequences based on profiles in Sibley et al (2016). According to Irimia and Roy (2008) and Sheth et al (2006), the first ‘G’ in the donor exon and the poly-Y close to the acceptor may not be conserved in some species. The general model takes ‘\|GTR...YAG\|’ as the optimal sequence. Both models also slightly prefer less frequent splice sites including ‘G\|GC...YAG\|’ and ‘\|AT...AC\|’.
--spsc FILE	Splice score file []. Each line is TAB-delimited, consisting of contig name, offset of the splice junction, strand (‘+’ or ‘-’), donor or acceptor (‘D’ or ‘A’) and an integer score. The score is added the donor/acceptor score function. It can be positive or negative and needs to be compatible with the scoring system. This option additionally increases -J and --J2 by 10 unless they are specified on the command line.
--spsc0 INT	Splice score for positions not in the --spsc file [-7]. This option has no effect if --spsc is not specified.
--spsc-max INT
	Cap splice scores to INT [14].
--io-coef FLOAT
	Logarithm intron length penalty (EXPERIMENTAL) [0.5]

-t INT	Number of threads [4]
--gff	Output in the GFF3 format. ‘##PAF’ lines in the output provide detailed alignments.
--gff-only	Output in the GFF3 format without ‘##PAF’ lines.
--aln	Output the residue alignment in three lines, where line ‘##ATN’ for the target nucleotide sequence, ‘##ATA’ for translated amino acid sequence and ‘##AQA’ for the query protein sequence. On a ‘##ATA’ line, ‘!’ denotes a frameshift insertion corresponding to the ‘F’ CIGAR operator and ‘$’ denotes a frameshift substitution corresponding to the ‘G’ operator.
--trans	Output translated protein sequences on ‘##STA’ lines.
--no-cs	Do not output the cs tag
--max-intron-out NUM
	In the --aln format, if an intron is longer than NUM, only output ceil(NUM/2) basepairs at the donor or the acceptor sites and write the full intron length LEN as ~LEN~ in the middle [200].
-P STR	Prefix for IDs in GFF3 or GTF [MP]. --gff-delim overrides this option.
--gff-delim CHAR
	Change the ID field in GFF3 to QueryNameCHARHitIndex []. If not specified, the default ID looks like ‘MP000012’. This option is only applicable to the GFF3 output format.
--gtf	Output in the GTF format
-u	Print unmapped query proteins
--outn NUM	Output up to min{NUM, -N} alignments per query [1000].
--outs FLOAT
	Output an alignment only if its score is at least FLOAT*bestScore, where bestScore is the best alignment score of the protein [0.99]
--outc FLOAT
	Output an alignment only if FLOAT fraction of the query protein is aligned [0.1]
-K NUM	Query batch size [2M]

Attribute	Type	Description
ID	str	mRNA identifier
Parent	str	Identifier of the parent feature
Rank	int	Rank among all hits of the query
Identity	real	Fraction of exact amino acid matches
Positive	real	Fraction of positive amino acid matches
Donor	str	2bp at the donor site if not GT
Acceptor	str	2bp at the acceptor site if not AG
Frameshift	int	Number of frameshift events in alignment
StopCodon	int	Number of in-frame stop codons
Target	str	Protein coordinate in alignment

Col	Type	Description
1	string	Protein sequence name
2	int	Protein sequence length
3	int	Protein start coordinate (0-based)
4	int	Protein end coordinate (0-based)
5	char	‘+’ for forward strand; ‘-’ for reverse
6	string	Contig sequence name
7	int	Contig sequence length
8	int	Contig start coordinate on the original strand
9	int	Contig end coordinate on the original strand
10	int	Number of matching nucleotides
11	int	Number of nucleotides in alignment excl. introns
12	int	Mapping quality (0-255 with 255 for missing)

Tag	Type	Description
AS	i	Alignment score from dynamic programming
ms	i	Alignment score excluding introns
np	i	Number of amino acid matches with positive scores
fs	i	Number of frameshifts
st	i	Number of in-frame stop codons
da	i	Distance to the nearest start codon
do	i	Distance to the nearest stop codon
cg	Z	Protein CIGAR
cs	Z	Difference string

Op	Description
nM	Alignment match. Consuming n*3 nucleotides and n amino acids
nI	Insertion. Consuming n amino acids
nD	Deletion. Consuming n*3 nucleotides
nF	Frameshift deletion. Consuming n nucleotides
nG	Frameshift match. Consuming n nucleotides and 1 amino acid
nN	Phase-0 intron. Consuming n nucleotides
nU	Phase-1 intron. Consuming n nucleotides and 1 amino acid
nV	Phase-2 intron. Consuming n nucleotides and 1 amino acid

Op	Regex	Description
:	[0-9]+	Number of identical amino acids
*	[acgtn]+[A-Z*]	Substitution: ref to query
+	[A-Z]+	# aa inserted to the reference
-	[acgtn]+	# nt deleted from the reference
~	[acgtn]{2}[0-9]+[acgtn]{2}	Intron length and splice signal

*	The initial conditions of dynamic programming are not technically correct, which may result in suboptimal residue alignment in rare cases.
*	Support for non-splicing alignment needs to be improved.

Manual Reference Pages - miniprot (1)

NAME

CONTENTS

SYNOPSIS

DESCRIPTION

OPTIONS

Indexing options

Chaining options

Alignment options

Input/Output options

OUTPUT FORMAT

The GFF3 Format

The PAF Format

LIMITATIONS