MyPro: A seamless pipeline for automated prokaryotic genome assembly and annotation
MyPro is a software pipeline for high-quality prokaryotic genome assembly and annotation. It was validated on 18 oral streptococcal strains to produce submission-ready, annotated draft genomes. MyPro installed as a virtual machine and supported by updated databases will enable biologists to perform quality prokaryotic genome assembly and annotation with ease.
The recent decrease in the cost of whole genome sequencing (WGS) technology has resulted in an increase in sequencing of various prokaryotic microbes. A typical genomics project requires processing of genome reads prior to data mining (Hasman et al., 2014; Rhoads et al., 2014). This process may include quality control checks and pre-processing measures, de novo assembly and/or reference-based assembly, automated annotation with or without manual improvement and improving genome quality by gap filling or other such methods. A variety of software is currently available for accomplishing genome assembly, annotation and improvement (Koren et al., 2014; Magoc et al., 2013; Seemann, 2014; Swain et al., 2012). However, the ability of a scientist or smaller laboratories without adequate bioinformatics training and support may limit execution of genome informatics tools to achieve meaningful results (Nocq et al., 2013).
Here, we introduce MyPro, a user-friendly genomics software pipeline for prokaryotic genomes that requires minimal programming skills. The pipeline consists of quality control and pre-processing tools, de novo assemblers, contig integration software, and tools for annotation and reference-based assembly.
To facilitate ease of use, MyPro is designed for automated de novo assembly, assembly integration and genome annotation (AutoRun.py). It consists of three main modules: Assemble, Integrate and Annotate (Fig. 1). Five commonly used assemblers including VelvetOptimiser 2.2.5 (Zerbino and Birney, 2008), Edena V3.131028 (Hernandez et al., 2008), Abyss 1.5.2 (Simpson et al., 2009), SOAPdenovo 2.01 (Luo et al., 2012) and SPAdes 3.1.1 (Bankevich et al., 2012) have been installed and configured on BioLinux 8 (Field et al., 2006), which enables the production of multiple assemblies within the Assemble module in MyPro (Assemble.py).
Most of the five assemblers use de Bruijn graph-based algorithms, which are highly dependent on the k-mer parameter. After inputting sequencing data in Fastq format, VelvetOptimiser is used first to perform a parameter sweep of k-mers ranging from 0.6 to 0.9 of read length. The optimal k-mer is automatically chosen for the highest N50 contig length and this k-mer is used for Abyss and SOAPdenovo assembly production. For paired reads, the VelvetOptimiser estimates insert size and its standard deviation, which are saved in MyPro for later use when desired. Depending on the read length of input reads, SPAdes assembly is obtained by setting k-mer lengths of 21, 33 and 55 for read lengths b150 bp, k-mer lengths of 21, 33, 55 and 77 for read lengths between 150 and 250 bp, and k-mer lengths of 21, 33, 55, 77, 99 and 127 for read lengths ≥250 bp. Edena uses an alternative approach to assemble, based on overlap layout, requiring a minimum overlap size (m) instead of k-mer for assembly. The value of m is set to half of the read length and is gradually increased to the read length. The optimal m is automatically chosen for the highest N50 contig length and the corresponding assembly is kept as Edena assembly.
Once multiple assemblies have been obtained, input reads are aligned using SOAP2 (Li et al., 2009) with the pre-calculated read information if desired (e.g., minimal and maximal insert size for paired reads). The alignment rate is used to evaluate the assemblies. To provide high-quality contigs and removing possible contaminants, the contigs with less than 100 aligned reads are removed and the basic assembly statistics, such as number of contigs, length of longest contig, N50 value and whole genome size are reported at the end of Assemble process. MyPro (Integrate.py) provides superior assembly by integrating the assemblies produced by the Assemble module using CISA 1.3 (Lin and Liao, 2013). In the automated pipeline, the worst assembly based on assembly statistics is removed prior to executing the Integrate module. Users are able to select among the five assemblies or add other assemblies into integration.
In the annotation module, we have significantly enhanced the limited core databases provided in Prokka (Seemann, 2014) to implement rapid annotation using the high-quality reference genome database (Tatusova et al., 2013) and an up-to-date Swiss-Prot database. MyPro (Annotate.py) annotates CISA assembly automatically, but users are able to annotate any other assembly placed on the specified folders (Assemble or Integrate).
In addition to the three main modules, MyPro provides functionalities for pre-process, exploration and post-assembly. In pre-process, with a specified genome size, MyPro (Preprocess.py) performs quality trimming and sub-samples input reads to a desired depth of coverage (default 100×). For assembly exploration, MyPro outputs aligned reads in BAM format, enabling visual inspection of assembly with Tablet (Milne et al., 2012) for quality assurance. If a closely related reference genome is available, r2cat (Husemann and Stoye, 2010) is employed for contig arrangement using the related reference to produce ordered and unmatched contigs. MyPro (Postassemble.py) utilizes both sets of contigs along with reads to post-assemble. Overlapping contigs (≥15 bp) at a reference-guided sequential order are merged. Reads unable to be aligned by SOAP2 on the merged contigs are assembled by Edena using a variety of m and those local assemblies are then used for bridging two contigs.
To validate MyPro, we performed genome assembly on three bacterial species chosen for extremes of GC-content: GC-poor Staphylococcus aureus (33% GC), GC-rich Rhodobacter sphaeroides 2.4.1 (69% GC) and Escherichia coli MG1655 (GC 51%). The publicly available sequencing reads were downloaded and analyzed by MyPro (see Supplementary data for details). The assembly results evaluated by QUAST 2.3 (Gurevich et al., 2013) demonstrate that MyPro produced high-quality annotated assemblies in terms of contiguity of assembly and annotation precision for coding sequences (Supplementary data). QUAST was also used to evaluate MyPro-produced assemblies for eight recently sequenced oral streptococcal strains that were post-assembled with available reference genomes (Table 1). For these strains, contigs less than 500 bp were discarded to maintain quality of data. Assemblies were also evaluated for ten other oral streptococcal strains de novo assembled by MyPro (AutoRun.py) (Table 2). For both oral streptococcal strains with available reference genomes and strains with no reference genome (Tables 1 and and2),2), MyPro consistently performed better than de novo assemblers and contig integration software (CISA) in terms of N50 value or number of contigs. We believe MyPro will be a useful bioinformatics tool for assembly, annotation and improvement of prokaryotic genomes for biomedical researchers. The software pipeline, components and user instructions are available for download at http://sourceforge.net/projects/sb2nhri/files/MyPro/.
This research was supported by grants from the National Health Research Institutes (PH-104-PP-05 and PH-104-SP-03), Ministry of Science and Technology, Taiwan (103-2320-B-400-001) and the National Institute of Dental and Craniofacial Research, U.S.A. (R01DE022673).
Appendix A. Supplementary data
- Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19:455–477. [PMC free article] [PubMed]
- Field D, Tiwari B, Booth T, Houten S, Swan D, Bertrand N, Thurston M. Open software for biologists: from famine to feast. Nat Biotechnol. 2006;24:801–804. [PubMed]
- Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29:1072–1075. [PMC free article] [PubMed]
- Hasman H, Saputra D, Sicheritz-Ponten T, Lund O, Svendsen CA, Frimodt-Møller N, Aarestrup FM. Rapid whole genome sequencing for the detection and characterization of microorganisms directly from clinical samples. J Clin Microbiol. 2014;52:139–146. http://dx.doi.org/10.1128/JCM.02452-13. [PMC free article] [PubMed]
- Hernandez D, François P, Farinelli L, Østerås M, Schrenzel J. De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. Genome Res. 2008;18:802–809. [PMC free article] [PubMed]
- Husemann P, Stoye J. r2cat: synteny plots and comparative assembly. Bioinformatics. 2010;26:570–571. [PMC free article] [PubMed]
- Koren S, Treangen TJ, Hill CM, Pop M, Phillippy AM. Automated ensemble assembly and validation of microbial genomes. BMC Bioinf. 2014;15:126. [PMC free article] [PubMed]
- Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics. 2009;25:1966–1967. [PubMed]
- Lin SH, Liao YC. CISA: contig integrator for sequence assembly of bacterial genomes. PLoS One. 2013;8:e60843. [PMC free article] [PubMed]
- Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience. 2012;1:18. [PMC free article] [PubMed]
- Magoc T, Pabinger S, Canzar S, Liu X, Su Q, Puiu D, Tallon LJ, Salzberg SL. GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinformatics. 2013;29:1718–1725. [PMC free article] [PubMed]
- Milne I, Stephen G, Bayer M, Cock PJ, Pritchard L, Cardle L, Shaw PD, Marshall D. Using Tablet for visual exploration of second-generation sequencing data. Brief Bioinform. 2012 bbs012. [PubMed]
- Nocq J, Celton M, Gendron P, Lemieux S, Wilhelm BT. Harnessing virtual machines to simplify next-generation DNA sequencing analysis. Bioinformatics. 2013;29:2075–2083. [PubMed]
- Rhoads DD, Sintchenko V, Rauch CA, Pantanowitz L. Clinical microbiology informatics. Clin Microbiol Rev. 2014;27:1025–1047. [PMC free article] [PubMed]
- Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014;30:2068–2069. [PubMed]
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19:1117–1123. [PMC free article] [PubMed]
- Swain MT, Tsai IJ, Assefa SA, Newbold C, Berriman M, Otto TD. A post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs. Nat Protoc. 2012;7:1260–1284. [PMC free article] [PubMed]
- Tatusova T, Ciufo S, Fedorov B, O’Neill K, Tolstoy I. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res. 2013 gkt1274. [PMC free article] [PubMed]
- Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18:821–829. [PMC free article] [PubMed]