Manual Reference Pages  - miniprot (1)

NAME

miniprot - protein-to-genome alignment with splicing and frameshifts

CONTENTS

Synopsis
Description
Options
     Indexing options
     Chaining options
     Alignment options
     Input/Output options
Output Format
     The GFF3 Format
     The PAF Format
Limitations

SYNOPSIS

* Indexing a genome (recommended as indexing can be slow and memory hungry):
miniprot [-t nThreads] -d ref.mpi ref.fna

* Aligning proteins to a genome:

miniprot [-t nThreads] ref.mpi protein.faa > output.paf
miniprot [-t nThreads] ref.fna protein.faa > output.paf

DESCRIPTION

Miniprot aligns protein sequences to a genome allowing potential frameshifts and splicing.

OPTIONS

    Indexing options

-k INT K-mer size for genome-wide indexing [6]
-M INT Sample k-mers at a rate 1/2**INT [1]. Increasing this option reduces peak memory but decreases sensitivity.
-L INT Minimum ORF length to index [30]
-b INT Number of bits per bin [8]. Miniprot splits the genome into non-overlapping bins of 2^8 bp in size.
-d FILE Write the index to FILE [].

    Chaining options

-S Disable splicing. It applies ‘-G1k -J1k -e1k’ at the same time.
-c NUM Ignore k-mers occurring NUM times or more [50k]
-G NUM Max intron size [200k]. This option overrides -I.
-I Set max intron size to min(max(3.6*sqrt(refLen),10000),300000) where refLen is the total length of the input genome.
-n NUM Min number of syncmers in a chain [10]
-m NUM Min chaining score [0]
-l INT K-mer size for the second round of chaining [5]
-e NUM Max extension from chain ends for alignment or the second round of chaining [10k]
-p FLOAT Filter out a secondary chain/alignment if its score is FLOAT fraction of the best chain [0.5]
-N NUM Retain at most NUM number of secondary chains/alignments [30]

    Alignment options

-O INT Gap open penalty [11]
-E INT Gap extension penalty [1]. A gap of size g costs {-O}+{-E}*g.
-J INT Intron open penalty [29]
-F INT Penalty for frameshifts or in-frame stop codons [23]
-C FLOAT Weight of splicing penalty [1]. Set to 0 to ignore splicing signals.
-B IN Bonus score for alignment reaching ends of proteins [5]
-j INT Splice model for the target genome: 2=mammal, 1=general, 0=none [1]. The mammal model considers ‘G|GTR...YYYNYAG|’ as the optimal splicing sequence and penalizes other sequences based on profiles in Sibley et al (2016). According to Irimia and Roy (2008) and Sheth et al (2006), the first ‘G’ in the donor exon and the poly-Y close to the acceptor may not be conserved in some species. The general model takes ‘|GTR...YAG|’ as the optimal sequence. Both models also consider less frequent splice sites including ‘G|GC...YAG|’ and ‘|AT...AC|’.

    Input/Output options

-t INT Number of threads [4]
--gff Output in the GFF3 format. ‘##PAF’ lines in the output provide detailed alignments.
--gff-only Output in the GFF3 format without ‘##PAF’ lines.
--aln Output the residue alignment in three lines, where line ‘##ATN’ for the target nucleotide sequence, ‘##ATA’ for translated amino acid sequence and ‘##AQA’ for the query protein sequence. On a ‘##ATA’ line, ‘!’ denotes a frameshift insertion corresponding to the ‘F’ CIGAR operator and ‘$’ denotes a frameshift substitution corresponding to the ‘G’ operator.
--max-intron-out NUM
  In the --aln format, if an intron is longer than NUM, only output ceil(NUM/2) basepairs at the donor or the acceptor sites and write the full intron length LEN as ~LEN~ in the middle [200].
-P STR Prefix for IDs in GFF3 or GTF [MP]. --gff-delim overrides this option.
--gff-delim CHAR
  Change the ID field in GFF3 to QueryNameCHARHitIndex []. If not specified, the default ID looks like ‘MP000012’. This option is only applicable to the GFF3 output format.
--gtf Output in the GTF format
-u Print unmapped query proteins
--outn NUM Output up to min{NUM, -N} alignments per query [1000].
--outs FLOAT
  Output an alignment only if its score is at least FLOAT*bestScore, where bestScore is the best alignment score of the protein [0.99]
--outc FLOAT
  Output an alignment only if FLOAT fraction of the query protein is aligned [0.1]
-K NUM Query batch size [2M]

OUTPUT FORMAT

    The GFF3 Format

Miniprot outputs alignment in the extended Pairwise mApping Format (PAF) by default (see the next subsection). It can also output GFF3 with option --gff. Miniprot may output three features: ‘mRNA’, ‘CDS’ or ‘stop_codon’. Here, a stop_codon is only reported if the alignment reaches the C-terminus of the protein and the next codon is a stop codon. Per GenCode rule, stop_codon is not part of CDS but it is part of mRNA or exon.

Miniprot may output the following attributes in GFF3:

AttributeTypeDescription
IDstrmRNA identifier
ParentstrIdentifier of the parent feature
RankintRank among all hits of the query
IdentityrealFraction of exact amino acid matches
PositiverealFraction of positive amino acid matches
Donorstr2bp at the donor site if not GT
Acceptorstr2bp at the acceptor site if not AG
FrameshiftintNumber of frameshift events in alignment
StopCodonintNumber of in-frame stop codons
TargetstrProtein coordinate in alignment

    The PAF Format

PAF gives detailed alignment. It is a TAB-delimited text format with each line consisting of at least 12 fields as are described in the following table:

ColTypeDescription
1stringProtein sequence name
2intProtein sequence length
3intProtein start coordinate (0-based)
4intProtein end coordinate (0-based)
5char‘+’ for forward strand; ‘-’ for reverse
6stringContig sequence name
7intContig sequence length
8intContig start coordinate on the original strand
9intContig end coordinate on the original strand
10intNumber of matching nucleotides
11intNumber of nucleotides in alignment excl. introns
12intMapping quality (0-255 with 255 for missing)

PAF may optionally have additional fields in the SAM-like typed key-value format. Miniprot may output the following tags:

TagTypeDescription
ASiAlignment score from dynamic programming
msiAlignment score excluding introns
npiNumber of amino acid matches with positive scores
daiDistance to the nearest start codon
doiDistance to the nearest stop codon
cgZProtein CIGAR
csZDifference string

A protein CIGAR consists of the following operators:

OpDescription
nMAlignment match. Consuming n*3 nucleotides and n amino acids
nIInsertion. Consuming n amino acids
nDDelection. Consuming n*3 nucleotides
nFFrameshift deletion. Consuming n nucleotides
nGFrameshift match. Consuming n nucleotides and 1 amino acid
nNPhase-0 intron. Consuming n nucleotides
nUPhase-1 intron. Consuming n nucleotides and 1 amino acid
nVPhase-2 intron. Consuming n nucleotides and 1 amino acid

The cs tag encodes difference sequences. It consists of a series of operations:

OpRegexDescription
:[0-9]+Number of identical amino acids
*[acgtn]+[A-Z*]Substitution: ref to query
+[A-Z]+# aa inserted to the reference
-[acgtn]+# nt deleted from the reference
~[acgtn]{2}[0-9]+[acgtn]{2}Intron length and splice signal

LIMITATIONS

* The DP alignment score (the AS tag) is not accurate.
* Need to introduce more heuristics for improved accuracy.


miniprot-0.9 (r223) miniprot (1) 9 March 2023