Prokka: rapid prokaryotic genome annotation

Author Notes

Abstract

Summary: The multiplex capability and high yield of current day DNA-sequencing instruments has made bacterial whole genome sequencing a routine affair. The subsequent de novo assembly of reads into contigs has been well addressed. The final step of annotating all relevant genomic features on those contigs can be achieved slowly using existing web- and email-based systems, but these are not applicable for sensitive data or integrating into computational pipelines. Here we introduce Prokka, a command line software tool to fully annotate a draft bacterial genome in about 10 min on a typical desktop computer. It produces standards-compliant output files for further analysis or viewing in genome browsers.

Availability and implementation: Prokka is implemented in Perl and is freely available under an open source GPLv2 license from http://vicbioinformatics.com/ .

Contact: torsten.seemann@monash.edu

1 INTRODUCTION

Genome annotation is the process of identifying and labeling all the relevant features on a genome sequence (Richardson and Watson, 2012). At minimum, this should include coordinates of predicted coding regions and their putative products, but it is desirable to go beyond this to non-coding RNAs, signal peptides and so on.

There are various online annotation servers ( Stewart et al. , 2009 ). The NCBI provides a Prokaryotic Genomes Automatic Annotation Pipeline service via email, with a turn-around time measured in days. RAST is a web server for annotating bacterial and archaeal genomes that provides annotation results in under a day ( Aziz et al. , 2008 ), and xBASE2 does similar in a few hours ( Chaudhuri et al. , 2008 ). These classes of tools are valuable, but they are not useful where throughput or privacy is critical.

Here we present Prokka, a command line software tool that can be installed on any Unix system. Prokka coordinates a suite of existing software tools to achieve a rich and reliable annotation of genomic bacterial sequences. Where possible, it will exploit multiple processing cores, and a typical bacterial genome can be annotated in ∼10 min on a quad core desktop computer. It is well suited to iterative models of sequence analysis and integration into genomic software pipelines.

2 DESCRIPTION

2.1 Input

Prokka expects preassembled genomic DNA sequences in FASTA format. Finished sequences without gaps are the ideal input, but it is expected that the typical input will be a set of scaffold sequences produced by de novo assembly software. This sequence file is the only mandatory parameter to the software.

2.2 Annotation

Prokka relies on external feature prediction tools to identify the coordinates of genomic features within contigs. These tools are listed in Table 1 , and all of them, except for Prodigal, provide coordinates and appropriate labels to describe the feature.

Table 1.

Open in new tab

Feature prediction tools used by Prokka

Tool (reference)	Features predicted
Prodigal ( Hyatt 2010 )	Coding sequence (CDS)
RNAmmer ( Lagesen et al. , 2007 )	Ribosomal RNA genes (rRNA)
Aragorn ( Laslett and Canback, 2004 )	Transfer RNA genes
SignalP ( Petersen et al. , 2011 )	Signal leader peptides
Infernal ( Kolbe and Eddy, 2011 )	Non-coding RNA

Tool (reference)	Features predicted
Prodigal ( Hyatt 2010 )	Coding sequence (CDS)
RNAmmer ( Lagesen et al. , 2007 )	Ribosomal RNA genes (rRNA)
Aragorn ( Laslett and Canback, 2004 )	Transfer RNA genes
SignalP ( Petersen et al. , 2011 )	Signal leader peptides
Infernal ( Kolbe and Eddy, 2011 )	Non-coding RNA

Table 1.

Open in new tab

Feature prediction tools used by Prokka

Tool (reference)	Features predicted
Prodigal ( Hyatt 2010 )	Coding sequence (CDS)
RNAmmer ( Lagesen et al. , 2007 )	Ribosomal RNA genes (rRNA)
Aragorn ( Laslett and Canback, 2004 )	Transfer RNA genes
SignalP ( Petersen et al. , 2011 )	Signal leader peptides
Infernal ( Kolbe and Eddy, 2011 )	Non-coding RNA

Tool (reference)	Features predicted
Prodigal ( Hyatt 2010 )	Coding sequence (CDS)
RNAmmer ( Lagesen et al. , 2007 )	Ribosomal RNA genes (rRNA)
Aragorn ( Laslett and Canback, 2004 )	Transfer RNA genes
SignalP ( Petersen et al. , 2011 )	Signal leader peptides
Infernal ( Kolbe and Eddy, 2011 )	Non-coding RNA

Proteins coding genes are annotated in two stages. Prodigal identifies the coordinates of candidate genes, but does not describe the putative gene product. The traditional way to predict what a gene codes for is to compare it with a large database of known sequences, usually at a protein sequence level, and transfer the annotation of the best significant match.

Prokka uses this method, but in a hierarchical manner, starting with a smaller trustworthy database, moving to medium-sized but domain-specific databases, and finally to curated models of protein families. By default, an e -value threshold of 10 ⁻⁶ is used with the following series of included databases:

An optional user-provided set of annotated proteins. These are expected to be trustworthy curated datasets and will be used as the primary source of annotation. They are searched using BLAST+ blastp ( Camacho et al. , 2009 ).
All bacterial proteins in UniProt ( Apweiler et al. , 2004 ) that have real protein or transcript evidence and are not a fragment. This is ∼16 000 proteins, and typically covers >50% of the core genes in most genomes. BLAST+ is used for the search.
All proteins from finished bacterial genomes in RefSeq for a specified genus. This captures domain-specific naming, and the databases vary in size and quality, depending on the popularity of the genus. BLAST+ is used for this and is optional.
A series of hidden Markov model profile databases, including Pfam ( Punta et al. , 2012 ) and TIGRFAMs ( Haft et al. , 2013 ). This is performed using hmmscan from the HMMER 3.1 package ( Eddy, 2011 ).
If no matches can be found, label as ‘hypothetical protein’.

2.3 Output

Prokka produces 10 files in the specified output directory, all with a common prefix. These are described in Table 2 .

Table 2.

Open in new tab

Description of Prokka output files

Suffix	Description of file contents
.fna	FASTA file of original input contigs (nucleotide)
.faa	FASTA file of translated coding genes (protein)
.ffn	FASTA file of all genomic features (nucleotide)
.fsa	Contig sequences for submission (nucleotide)
.tbl	Feature table for submission
.sqn	Sequin editable file for submission
.gbk	Genbank file containing sequences and annotations
.gff	GFF v3 file containing sequences and annotations
.log	Log file of Prokka processing output
.txt	Annotation summary statistics

Suffix	Description of file contents
.fna	FASTA file of original input contigs (nucleotide)
.faa	FASTA file of translated coding genes (protein)
.ffn	FASTA file of all genomic features (nucleotide)
.fsa	Contig sequences for submission (nucleotide)
.tbl	Feature table for submission
.sqn	Sequin editable file for submission
.gbk	Genbank file containing sequences and annotations
.gff	GFF v3 file containing sequences and annotations
.log	Log file of Prokka processing output
.txt	Annotation summary statistics

Table 2.

Open in new tab

Description of Prokka output files

Suffix	Description of file contents
.fna	FASTA file of original input contigs (nucleotide)
.faa	FASTA file of translated coding genes (protein)
.ffn	FASTA file of all genomic features (nucleotide)
.fsa	Contig sequences for submission (nucleotide)
.tbl	Feature table for submission
.sqn	Sequin editable file for submission
.gbk	Genbank file containing sequences and annotations
.gff	GFF v3 file containing sequences and annotations
.log	Log file of Prokka processing output
.txt	Annotation summary statistics

Suffix	Description of file contents
.fna	FASTA file of original input contigs (nucleotide)
.faa	FASTA file of translated coding genes (protein)
.ffn	FASTA file of all genomic features (nucleotide)
.fsa	Contig sequences for submission (nucleotide)
.tbl	Feature table for submission
.sqn	Sequin editable file for submission
.gbk	Genbank file containing sequences and annotations
.gff	GFF v3 file containing sequences and annotations
.log	Log file of Prokka processing output
.txt	Annotation summary statistics

3 RESULTS

Prokka was designed to be both accurate and fast. To assess accuracy, we compared the annotations of Prokka, RAST and xBase2 for the highly curated Escherichia coli K-12 genome. All methods were told it was an E.coli genome. Table 3 shows that Prokka produced an overall better annotation than both RAST and xBase2. This result could vary for less well-studied or draft genomes.

Table 3.

Open in new tab

Comparison of annotation of E.coli K-12 accession U00096.2

Feature	Reference	Prokka	RAST	xBase2
Total CDS	4321	4305	4512	4444
Matching start	–	3828	3571	3025
Different start	–	318	533	1052
Missing CDS	–	172	214	241
Extra CDS	–	159	405	367
Hypothetical protein	18	276	638	156
With EC number	1114	1050	1118	0
Total tRNA	89	88	86	88
Total rRNA	22	22	22	22

Feature	Reference	Prokka	RAST	xBase2
Total CDS	4321	4305	4512	4444
Matching start	–	3828	3571	3025
Different start	–	318	533	1052
Missing CDS	–	172	214	241
Extra CDS	–	159	405	367
Hypothetical protein	18	276	638	156
With EC number	1114	1050	1118	0
Total tRNA	89	88	86	88
Total rRNA	22	22	22	22

The bold denotes the best performing tool (column) for that attribute (row). The italics are “subsets” of the “Total CDS” section.

Table 3.

Open in new tab

Comparison of annotation of E.coli K-12 accession U00096.2

Feature	Reference	Prokka	RAST	xBase2
Total CDS	4321	4305	4512	4444
Matching start	–	3828	3571	3025
Different start	–	318	533	1052
Missing CDS	–	172	214	241
Extra CDS	–	159	405	367
Hypothetical protein	18	276	638	156
With EC number	1114	1050	1118	0
Total tRNA	89	88	86	88
Total rRNA	22	22	22	22

Feature	Reference	Prokka	RAST	xBase2
Total CDS	4321	4305	4512	4444
Matching start	–	3828	3571	3025
Different start	–	318	533	1052
Missing CDS	–	172	214	241
Extra CDS	–	159	405	367
Hypothetical protein	18	276	638	156
With EC number	1114	1050	1118	0
Total tRNA	89	88	86	88
Total rRNA	22	22	22	22

The bold denotes the best performing tool (column) for that attribute (row). The italics are “subsets” of the “Total CDS” section.

Prokka uses parallel processing to decrease running time on multicore computers. The most time-consuming steps are BLAST+ and hmmscan, which both support multiple CPUs natively. However, Prokka is more efficient if it runs multiple single CPU threads on subsets of the data, which it achieves using GNU parallel ( Tange, 2011 ). Experiments on our 64-core AMD Opteron server on single genomes show linear speedup with up to eight cores and sublinear gain thereafter. However, for much larger bacterial meta-genome datasets, linear speedup is observed for many more CPUs. To annotate the E.coli K-12 genome on a typical quad-core desktop computer takes about 6 min.

ACKNOWLEDGEMENTS

The author thanks Dieter Bulach, Simon Gladman, Tim Stinear, Connor Skennerton, Scott Chandry, David Powell, Adam Caldwell, Roderick Felsheim, John Nash, Nick Loman, Heikki Lehvaslaiho, Bastien Chevreux, Nicola Soranzo, Jon Graf, Harald Gruber-Vodicka, Haruo Suzuki, Geoff Winsor, Lionel Guy, Andrew Page and Ole Tange for suggestions and bug reports.

Funding : This research was supported in part by the Victorian Life Sciences Computation Initiative, an initiative of the Victorian Government hosted by the University of Melbourne, Australia.

Conflicts of Interest: none declared.

REFERENCES

Aziz

et al.

The RAST Server: rapid annotations using subsystems technology

BMC Genomics

2008

, vol.

pg.

Month:	Total Views:
November 2016	30
December 2016	42
January 2017	222
February 2017	397
March 2017	402
April 2017	275
May 2017	346
June 2017	363
July 2017	389
August 2017	376
September 2017	405
October 2017	521
November 2017	583
December 2017	1,262
January 2018	1,341
February 2018	1,163
March 2018	1,318
April 2018	1,185
May 2018	1,422
June 2018	1,317
July 2018	1,017
August 2018	1,242
September 2018	1,056
October 2018	1,162
November 2018	1,414
December 2018	913
January 2019	1,146
February 2019	1,199
March 2019	1,508
April 2019	1,360
May 2019	1,361
June 2019	1,089
July 2019	1,392
August 2019	1,359
September 2019	1,392
October 2019	1,469
November 2019	1,516
December 2019	1,199
January 2020	1,363
February 2020	1,704

Article Contents

Prokka: rapid prokaryotic genome annotation

Abstract

1 INTRODUCTION

2 DESCRIPTION

2.1 Input

2.2 Annotation

2.3 Output

3 RESULTS

ACKNOWLEDGEMENTS

REFERENCES

Author notes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Looking for your next opportunity?

This Feature Is Available To Subscribers Only