In the mid-1970s, a geographic clustering of an unusual rheumatoid arthritis-like condition was reported in Connecticut1. That cluster of cases focused attention on the syndrome that is now called Lyme disease. It was subsequently realized that a similar disorder had been known in Europe since the beginning of this century. Lyme disease is characterized by some or all of the following manifestations: an initial erythematous annular rash, 'flu-like symptoms, neurological complications, and arthritis in about 50% of untreated patients2. In the United States, the disease occurs primarily in northeastern and midwestern states, and in western parts of California and Oregon. These regions coincide with the ranges of various species of Ixodes ticks, the primary vector of Lyme disease. Lyme disease is now the most common tick-transmitted illness in the United States, and hasbeen reported in many temperate parts of the Northern Hemisphere.
It was not until the early 1980s that a new spirochaete, Borrelia burgdorferi3, was isolated and cultured from the midgut of Ixodes ticks, and subsequently from patients with Lyme disease4,5. Analysis of genetic diversity among individual Borrelia isolates has defined a closely related cluster containing at least 10 tick-borne species of Lyme disease agents, called 'B. burgdorferi(sensu lato)'. B. burgdorferi resembles most other spirochaetes in that it is a highly specialized, motile, two-membrane, spiral-shaped bacterium that lives primarily as an extracellular pathogen. Borrelia is fastidious and difficult to culture in vitro, requiring a specially enriched media and low oxygen tension6.
One of the most striking features of B. burgdorferi is its unusual genome, which includes a linear chromosome approximately one megabase in size7, 8, 9, 10 and numerous linear and circular plasmids11, 12, 13 with some isolates containing up to 20 different plasmids. The plasmids have a copy number of approximately one per chromosome10,14, and different plasmids often appear to share regions of homologous DNA13, 15, 16. Long-term culture of B. burgdorferi results in the loss of some plasmids, changes in protein expression profiles, and a loss in the ability of the organism to infect laboratory animals, suggesting that the plasmids encode important proteins involved in virulence17, 18, 19.
Because of its importance as a pathogen of humans and animals, and the value of complete genome sequence information for understanding its life cycle and advancing drug and vaccine development, we sequenced the genome of B. burgdorferi type strain (B31), using the random sequencing method previously described20, 21, 22, 23, 24. Here we summarize the results from sequencing, assembly and analysis of the linear chromosome and 11 plasmids.
The linear chromosome of B. burgdorferi has 910,725 base pairs (bp) and an average G+C content of 28.6%. Base pair one represents the first double-stranded base pair that we observed at the left telomere. Previous genome characterizations agree with the nucleotide sequence of the large chromosome10,25, 26, 27, 28. The 853 predicted coding sequences (open reading frames; ORFs) have an average size of 992 bp, similar to that observed in other prokaryotic genomes, with 93% of the B. burgdorferi genome representing coding sequence. Biological roles were assigned to 59% of the 853 ORFs using the classification scheme adapted from Riley29 (Fig. 1 (PDF File: 1,306k)), 12% of ORFs matched hypothetical coding sequences of unknown function from other organisms, and 29% were new genes. The average relative molecular mass (Mr) of the chromosome-encoded proteins in B. burgdorferi is 37,529 ranging from 3,369 to 254,242, values similar to those observed in other bacteria including Haemophilus influenzae20 and Mycoplasma genitalium21. The median isoelectric point (pI) for all predicted proteins is 9.7.
Analysis of codon usage in B. burgdorferi reveals that all 61 triplet codons are used. When both AU- and GC-containing codons specify a single amino acid, there is a marked bias (from 2-fold to more than 20-fold, depending on the amino acid) in the use of AU-rich codons. The most frequently used codons are AAA (Lys, 8.1%), AAU (Asn, 5.9%), AUU (Ile, 5.9%), UUU (Phe, 5.7%), GAA (Glu, 5.0%), GAU (Asp, 4.2%) and UUA (Leu, 4.2%). The most common amino acids are Ile (10.6%), Leu (10.3%), Lys (10.2%), Ser (7.8%) and Asn (7.2%). The high value for Lys is in agreement with the median calculated isoelectric point of 9.7.
Analysis of the nucleotide sequence and Southern analyses on B. burgdorferi DNA indicate that, in addition to the large linear chromosome, isolate B31 contains linear plasmids of the following approximate sizes: 56 kilobase pairs (kbp) (lp56), 54 kbp (lp54), four plasmids of 28 kbp (lp28-1, lp28-2, lp28-3 and lp28-4), 38 kbp (lp38), 36 kbp (lp36), 25 kbp (lp25) and 17 kbp (lp17); and circular plasmids of the following sizes: 9 kbp (cp9), 26 kbp (cp26) and five or six homologous plasmids of 32 kbp (cp32). These include all of the plasmids previously identified in this strain, but comparisons with other B31 cultures suggest that this isolate may have lost one 21 kbp linear and one or two 32 kbp circular plasmids during growth in culture since its original isolation11, 12, 13, 14,19,30. The sequences of all plasmids were assembled as part of this project. However, the assembled sequences of the cp32 and related lp56 plasmids could not be determined with a high degree of confidence because of DNA sequence similarity among them (99% in several regions of 3,000–5,000 bp per plasmid)13,16 (Table 1). Improved assembly strategies are being tested to achieve closure on these plasmids (G.Sutton, unpublished). Plasmid lp17 is identical to that of lp16.9 from Barbour et al.15.
The 11 plasmids we have described contain a total of 430 putative ORFs with an average size of 507 bp; plasmid G+C content ranges from 23.1% to 32.3%. Only 71% of plasmid DNA represents predicted coding sequences, a value significantly lower than that on the chromosome. This indicates that average intergenic distances are greater in the plasmids than in the chromosome, and that many potential ORFs contain authentic frameshifts or stops (see E29, for example), suggesting that they are decaying genes not encoding functional proteins. Of the 430 plasmid ORFs, only 70 (16%) could be identified and these include membrane proteins such as OspA-D, decorin-binding proteins, the VlsE lipoprotein recombination cassette, and the purine ribonucleotide biosynthetic enzymes GuaA and GuaB. We found that 100 ORFs (23%) match other hypothetical proteins from plasmids in this and related strains of B. burgdorferi15,16,31; 10 ORFs (2.3%) match hypothetical proteins from species other than Borrelia; and 250 ORFs (58%) have no database match.
We found that 47 paralogous gene families containing from 2 to 12 members account for 39% (169 ORFs) of the plasmid-encoded genes with no known biological role (Fig. 1 (PDF File: 1,306k)). Paralogue families 32 and 50, typified by previously identified B. burgdorferi plasmid genes cp32 orfC and cp8.3 orf2, respectively, have some similarities to proteins involved in replication, segregation and control of copy number in other bacterial systems ,31. Previous studies have reported examples of plasmid gene duplication, but the extent of this redundancy has become even more apparent with the complete sequence of these 11 plasmids from isolate B31. Moreover, a preliminary search of 221 putative ORFs from the cp32s and lp56 indicates that at least 50% display 70% amino-acid similarity to ORFs from the other 11 plasmids presented here (data not shown). Although plasmid-encoded genes have been implicated in infectivity and virulence17, 18, 19, the biological roles of most of these genes are not known. The significance of the large number of paralogous plasmid-encoded genes is not understood. These proteins may beexpressed differentially in tick and mammalian hosts, or may undergo homologous recombination to generate antigenic variation in surface proteins. This hypothesis is supported by the identification of 63 plasmid-encoded putative membrane lipoproteins (Fig. 1 (PDF File: 1,306k)).
Several copies of a putative recombinase/transposase similar to IS891-like transposases were identified in the B. burgdorferi plasmids. Linear plasmid 28-2 contains one full-length copy of this gene. Although no inverted repeats were found on either side of the transposase, there is a putative ribosome-binding site several nucleotides upstream of the apparent start codon, and a stem–loop structure (-27 kcal mol-1) 195 bp downstream of the stop codon in an area with no ORFs. This transposase might represent a functional gene important for the frequent DNA rearrangements that presumably occur in Borrelia plasmids. There are other partial or nearly complete copies of the transposase gene that contain frame-destroying mutations elsewhere in the genome: two copies on lp17, one on lp36, one on lp38, one on lp28-3, two on lp28-1, and one near the right end of the large linear chromosome.
Origin of replication
The replication mechanism for the linear chromosome and plasmids in B. burgdorferi is not yet known. Replication possibly begins at the termini, as has been proposed for the poxvirus hairpin telomeres 32, or may begin from a single origin somewhere along the length of the linear replicon. Of the genes on the linear chromosome, 66% are transcribed away from the centre of the chromosome (Fig. 1 (PDF File: 1,306k)), similar to the transcriptional bias observed for the genomes of M. genitalium21 and M. pneumoniae33. It has beensuggested that bacterial genes are optimally transcribed in the same direction as that in which replication forks pass over them, particularly for highly transcribed genes34,35
Given the transcriptional bias observed in B. burgdorferi, it seems likely that the origin of replication is near the centre of the chromosome. Because bacterial chromosomal replication origins are usually near dnaA36, it is intriguing to note that this gene (BB437) lies almost exactly at the centre of the linear B. burgdorferi chromosome10,27. A centrally initiated, bi-directional replication fork would be equidistant from the two chromosome ends, and replication would traverse the rRNA genes in the same direction as transcription.
An analysis of GC skew, (G - C)/(G + C) calculated in 10-kilobase (kb) windows across the chromosome, shows a clear break at the putative origin of replication. The GC-skew values are uniformly negative from 0 to 450 kb (minus strand), and uniformly positive (plus strand) from 450 kb to the end of the chromosome (Fig. 2). Additional evidence for the location of the origin of replication comes from our discovery of an octamer, TTGTTTTT, whose skewed distribution in the plus versus the minus strand of the chromosome matches the GC skew (Fig. 2). The biological significance of this octamer has not yet been determined, although it may be analogous to the Chi site in Escherichia coli that is implicated in recBCD mediated recombination. No GC skew was observed in any of the plasmids, although the heptamer ATTTTTT displays a skewed distribution in the plus versus the minus strand of lp28-4 that changes at the approximate midpoint of the plasmid (not shown).
Figure 2: Distribution of TTGTTTTT and GC skew in the B. burgdorferi chromosome. Top, distribution of the octamer TTGTTTTT.
The lines in the top panel represent the location of this octamer in the plus strand of the sequence, and those in the second panel represent the location of this oligomer in the minus strand of the sequence. Bottom, GC skew.High resolution image and legend (12K)
Transcription and translation
Genes encoding the three subunits (, , ') of the core RNA polymerase were identified in B. burgdorferi along with 70 and two alternative factors, 54 and rpoS. The role and specificity of each of these factors in transcription regulation in B. burgdorferi are not known. The nusA, nusB and rho genes, which are involved in transcription elongation and termination, were also identified.
A region of the genome with a significantly higher G+ C content (43%), located between nucleotides 434,000 and 447,000, contains the rRNA operon. As previously reported, the rRNA operon in B. burgdorferi contains a 16S rRNA–Ala-tRNA–Ile-tRNA–23S rRNA–5S rRNA–23S rRNA–5S rRNA37,38. All of the genes are present in the same orientation, except for that encoding Ile tRNA. Four unrelated genes, encoding 3-methyladenine glycosylase, hydrolyase and two with no database match, are also present in the rRNA operon. Three of these genes are transcribed in the same direction as the rRNAs.
We identified in the chromosome 31 tRNAs with specificity for all 20 amino acids (Fig. 1 (PDF File: 1,306k)). These are organized into 7 clusters plus 13 single genes. All tRNA synthetases are present except glutaminyl tRNA-synthetase. A single glutamyl tRNA synthetase probably aminoacylates both tRNAGlu and tRNAGln with glutamate followed by transamidation by Glu-tRNA amidotransferase, a heterotrimeric enzyme present in B. burgdorferi and several Gram-positive bacteria and archaea30. The lysyl-tRNA synthetase (LysS) in B. burgdorferi is a class I type that has no resemblance to any known bacterial or eukaryotic LysS, but is most similar to LysS from the archaea40.
Replication, repair and recombination
The complement of genes in B. burgdorferi involved in DNA replication is smaller than in E. coli, but similar to that in M. genitalium21. Three ORFs have been identified with high homology to four of the ten polypeptides in the E. coli DNA polymerase III: , and , and . In E. coli, the and proteins are produced by programmed ribosomal frameshifting. This observation suggests that DNA replication in B. burgdorferi, like that in M. genitalium, is accomplished with a restricted set of genes. B. burgdorferi has one type I topoisomerase (topA) and two type II topoisomerases (gyrase and topoisomerase IV) for DNA topology management and chromosome segregation, despite its linear chromosomal structure. This suggests that topoisomerase IV may be required for more than the separation of circular DNAs during segregation.
The DNA repair mechanisms in B. burgdorferi are similar to those in M. genitalium. DNA excision repair can presumably occur by a pathway involving endonuclease III, PolI and DNA ligase. The genes for two of three DNA mismatch repair enzyme (mutS, mutL) are present. The apparent absence of mutH is consistent with the lack of GATC (dam) methylation in strain B31 (S. Casjens, unpublished). Also present are genes for the repair of ultraviolet-induced DNA damage (uvrA, uvrB, uvrC and uvrD) (Table 2 (PDF File: 32k)).
B. burgdorferi has a complete set of genes to perform homologous recombination, including recA, recBCD, sbcC, sbcD, recG, ruvAB and recJ. 3'-Exonuclease activity associated with sbcB in E. coli may be encoded by exoA (exodeoxynuclease III). Although recA is present, we found no evidence for lexA, which encodes the repressor that regulates SOS genes in E. coli. No genes encoding DNA restriction or modification enzymes are present.
The small genome size of B. burgdorferi is associated with an apparent absence of genes for the synthesis of amino acids, fatty acids, enzyme cofactors, and nucleotides, similar to that observed with M. genitalium21 (Fig. 3, Table 2 (PDF File: 32k)). The lack of biosynthetic pathways explains why growth of B. burgdorferi in vitro requires serum-supplemented mammalian tissue-culture medium. This is also consistent with previous biochemical data indicating that Borrelia lack the ability to elongate long-chain fatty acids, such that the fatty-acid composition of Borrelia cells reflects that present in the growth medium6.
A schematic diagram of a B. burgdorferi cell providing an integrated view of the transporters and the main components of the metabolism of this organism, as deduced from the genes identified in the genome. The ORF numbers correspond to those listed in Table 2 (red indicates chromosomal and blue indicates plasmid ORFs). Presumed transporter specificity is indicated. Yellow circles indicate: places where particular uncertainties exist as to the substrate specificity, subcellular location or direction of catalysis: or expected activities that were not found.High resolution image and legend (154K)
The linear chromosome of B. burgdorferi contains 46 ORFs and the plasmids contain 6 ORFs that encode transport and binding proteins (Fig. 3, Table 2 (PDF File: 32k)). These gene products contribute to 16 distinct membrane transporters for amino acids, carbohydrates, anions and cations. The distribution of transporters between the four categories of functions in this section is similar to that observed in other heterotrophs (such as Haemophilus influenzae, M. genitalium and H. pylori), with most being dedicated to the import of organic compounds.
There are marked similarities between the transport capacity of B. burgdorferi and M. genitalium. Both genomes have a limited number of recognizable transporters, so it is not clear how they can sustain diverse physiological reactions. Several of the identified transporters in both genomes exhibit broad substrate specificity, exemplified by the oligopeptide ABC transporter (opp operon) or the glycine, betaine, L-proline transport system (proVWX). Therefore, these organisms probably compensate for their restricted coding potential by producing proteins that can import a wide variety of solutes. This is important because B. burgdorferi is unable to synthesize any amino acids de novo. We were unable to identify any transport systems for nucleosides, nucleotides, NAD/NADH or fatty acids, although they are likely to be present.
Glucose, fructose, maltose and disaccharides seem to be acquired by the phosphoenolpyruvate:phosphotransferase system (PTS). The two nonspecific components, enzyme 1 (ptsl) and Hpr (ptsH), are associated in one operon with an apparently glucose-specific, phosphohistidine-sugar phosphotransferase enzyme IIA (crr). Separate from this operon are four permeases (enzyme IIBC), fruA in two copies (fructose), ptsG (glucose) andmalX (glucose/maltose) (Fig. 3, Table 2 (PDF File: 32k)). The fructose-specific enzyme IIA is induced in the ORF with IIBC (fruA), as has been observed in M. genitalium41. Ribose may be imported by an ATP-binding cassette transporter (rbsAC). The rbsAC genes are transcribed in an operon with a methyl-accepting chemotaxis protein that may respond to -galactosides, suggesting that movement of the organisms towards sugars may be coupled to the transport process.
The limited metabolic capacity of B. burgdorferi is similar to that found in M. genitalium (Fig. 3, Table 2 (PDF File: 32k)). Genes encoding all of the enzymes of the glycolytic pathway were identified. Analysis of the metabolic pathway suggests that B. burgdorferi uses glucose as a primary energy source, although other carbohydrates, including glycerol, glucosamine, fructose and maltose, may be used in glycolysis. Pyruvate produced by glycolysis is converted to lactate, consistent with the microaerophilic nature of B. burgdorferi. Generation of reducing power occurs through the oxidative branch of the pentose pathway. None of the genes encoding proteins of the tricarboxylic acid cycle or oxidative phosphorylation were identified. The similarity in metabolic strategies of two distantly related, obligate parasites, M. genitalium and B. burgdorferi, suggests convergent evolutionary gene loss from more metabolically competent, distant progenitors.
Addition of N-acetylglucosamine (NAG) to culture medium is required for growth of B. burgdorferi6. NAG is incorporated into the cell wall, and may also serve as an energy source. The cp26 plasmid encodes a PTS cellobiose transporter homologue that could have specificity for the structurally similar compound chitobiose (di-N-acetyl-D-glucosamine). A gene product on the chromosome with sequence similarity to chitobiase (BB2) may convert chitobiose to NAG. B. burgdorferi can metabolize NAG to fructose-6-phosphate, which then can enter the glycolytic cycle through the action of N-acetylglucosamine-6-phosphate deacetylase and glucosamine-6-phosphate isomerase. NAG is the primary constituent of chitin, which makes up the tick cuticle6, and may be a source of carbohydrate for B. burgdorferi when it is associated with its tick host.
The parallels between B. burgdorferi and M. genitalium appear to extend to other aspects of their metabolism. Both organisms lack a respiratory electron transport chain, so ATP production must be accomplished by substrate-level phosphorylation. Consequently, membrane potential is established by the reverse reaction of the V1V0-type ATP synthase, here functioning as an ATPase to expel protons from the cytoplasm (Fig. 3, Table 2 (PDF File: 32k)). The ATP synthase genes in B. burgdorferi appear to be transcribed as part of a seven-gene operon. They are not typical of those usually found in eubacteria, more closely resembling the eukaryotic vacuolar (V-type) and archaeal (A-type) H+-translocating ATPases42, both in size and sequence similarity, than the bacterial F1F0 ATPases. Genome analysis of Treponema pallidum, the pathogenic spirochaete that causes syphilis, has also revealed the presence of a V1V0-type ATP synthase (C. M. F. et al., manuscript in preparation), suggesting that this may be a feature of spirochaetes.
Although the expression of Borrelia genes varies according to the current host species, temperature, host body location and other local factors, control of gene expression appears to differ from more well studied eubacteria. A typical set of homologues of heat-shock response genes is present (groES, groEL, grpE, dnaJ, hslU, hslV, dnaK and htpG), and B. burgdorferi is known to have such a response; however, it lacks the -32 that controls their transcription in E. coli. Only a few homologues to other eubacterial regulatory proteins are present, including only two response-regulator two-component systems.
Motility and chemotaxis
Like other spirochaetes, B. burgdorferi has periplasmic flagella that are inserted at each end of the cell and extend towards the middle of the cell body. The unique flagella allow the organism to move through viscous solutions, an ability that is presumed to be important in its migration to distant tissues following deposition in the skin layers43. Proteins involved in motility and chemotaxis are encoded by 54 genes, more than 6% of the B. burgdorferi chromosome, most of which are arranged in eight operons containing between 2 and 25 genes.
B. burgdorferi contains several copies of the chemotaxis genes (cheR, cheW, cheA, cheY and cheB) downstream of the methyl-accepting chemotaxis proteins. Other eubacteria also have duplications of some che genes, but those genes in B. burgdorferi are the most redundant set yet found. B. burgdorferi lacks recognizable virulence factors; thus, its ability to migrate to distant sites in the tick and mammalian host is probably dependent on a robust chemotaxis response. Multiple chemotaxis genes may provide redundancy in this system in order to meet such challenges or, alternatively, these genes may be differentially expressed under varied physiological conditions. Another speculative possibility is that the flagellar motors at the two ends of the B. burgdorferi cell are different and require different che systems. In support of this idea is the observation that one of the motor switch genes, fliG, is also present in two copies.
Membrane protein analysis
Much of the previous work on B. burgdorferi has focused on outer-surface membrane genes because of their potential importance in bacterial detection and vaccination. Nearly all Borrelia membrane proteins have been found to be typical bacterial lipoproteins. A search of B. burgdorferi ORFs for a consensus lipobox in the first 30 amino acids identified 105 putative lipoproteins, representing more than 8% of coding sequences. This contrasts with a total of only 20 putative lipoproteins in the 1.67-million base pair H. pylori genome (1.3% of coding sequences)23. The periplasmic binding proteins involved in transport of amino acids/peptides and phosphate in B. burgdorferi are candidate lipoproteins, suggesting that they may be anchored to the outer surface of the cytoplasmic membrane as in Gram-positive bacteria, rather than localized in the periplasmic space.
In better-characterized eubacteria, prolipoprotein diacylglycerol transferase (lgt), prolipoprotein signal peptidase (lsp), and apolipoprotein:phospholipid N-acyl transferase (lnt) are required for post-translational processing and addition of lipids to the amino-terminal cysteine. Genes for the first two of the enzymes (lgt and lsp) are present in the B. burgdorferi genome, but the gene for lnt was not identified, although biochemical evidence argues for all three activities in B. burgdorferi44. The sequence similarity of an lnt homologue in B. burgdorferi may be too low to be identified using our search methods, or its activity may be present in a new enzyme. In E. coli the Sec protein export system moves lipoproteins through the inner membrane, and Borrelia carries a complete set of these protein-secretion gene homologues (secA/D/E/F/Y and tth; only the non-essential secB is missing).
Analysis of telomeres
The two chromosomal telomeres of strain B31 have similar 26-bp inverted terminal sequences (Fig. 4). We found no other similarity between the two ends, and these 26-bp sequences are very similar to the previously characterized Borrelia telomeres. Terminal restriction fragments from both B31 chromosomal termini were shown to exhibit snapback kinetics (data not shown), strongly indicating that both terminate in covalently closed hairpins, like previously characterized Borrelia telomeres28,45,46.
Nucleotide sequences are shown for known Borrelia telomeres as indicated: 1, B.burgdorferi Sh-2-82 chromosome left end; 2, B. burgdorferi B31 chromosome left end; 3, B.afzelii R-IP3 chromosome right end; 4, B. burgdorferi B31 chromosome right end; 5, B. burgdorferi B31 plasmid Ip17 left end; 6, B. burgdorferi B31 plasmid Ip17 right end; 7, B. hermisii plasmids bp7E and pb21E right ends; 8, B. burgdorferi B31 plasmid Ip28-1 right end. In each case the telomere is at the left. Question marks (?) indicate locations where S1 nuclease was used to open terminal hairpins during the sequence determinations. Stippled areas highlight regions that appear to have been most highly conserved among these telomeres; no strong sequence conservation has been found near the right of the terminal 26 bp among the different sequences listed, except between the chromosomal left ends from strains B31 and Sh-2-82 (see text). The telomeric sequences of the strain B31 chromosome were determined in this report; the others are from references 14, 28, 30, 45, 46.High resolution image and legend (24K)
The left chromosomal telomere of strain B31 is identical to the previously characterized left telomere of strain Sh-2-82 (ref. 28), except for a 31 bp insertion in B31 26 bp from the end. The right-most 7,454 bp contains surprisingly few ORFs, given the ORF density elsewhere on the chromosome. The function of this region is unknown, but it contains several unusual features. The right terminal 900 bp contains considerable homology to the left ends of lp17 and lp28-3. The region between 3,600 bp and 8,000 bp from the right end also contains several areas with similarity to plasmid sequences, including a portion of the transposase-like gene approximately 4,500 bp from the right end. The spacing between the two conserved motifs (ATATAAT and TAGTATA) in the right 26-bp terminal repeat is the same as most previously known plasmid telomeres but different from the previously known chromosomal telomeres. These findings support the idea that the right end of the Borrelia chromosome has historically exchanged telomeres with the linear plasmids28.
The B. burgdorferi genome sequence will provide a new starting point for the study of the pathogenesis, prevention and treatment of Lyme disease. With the exception of a small number of putative virulence genes (haemolysins and drug-efflux proteins), this organism contains few, if any, recognizable genes involved in virulence or host–parasite interactions, suggesting that B. burgdorferi differs from better-studied eubacteria in this regard. It will be interesting to determine the role of the multi-copy plasmid-encoded genes, as previous work has implicated plasmid genes in infectivity and virulence. The completion of the genome sequence from a second spirochaete, Treponema pallidum (C.M.F. et al., manuscript in preparation) will allow for the identification of genes specific to each species and to this bacterial phylum, and will provide further insight into prokaryotic diversity.
Cell lines. A portion of a low-passage subculture of the original Lyme-disease spirochaete tick isolate4 was obtained from A. Barbour. The type strain of B. burgdorferi (ATCC 35210)3, B31, was derived from this isolate by limiting dilution cloning5. Cells were grown in Barbour–Stoenner–Kelly medium II (BSKII)6, omitting the additions of antibiotics and gelatin, in tightly closed containers at 33–34 °C. Cells were subcultured three or fewer times in vitro between successive rounds of infection in C3H/HeJ mice to minimize loss of infectivity and plasmid content17,18. After four successive transfers of infection in mice, a primary culture of B31, established from infected ear tissue, was expanded to 2.5 l by four successive subcultures. All available evidence indicates that the B31 line used for preparation of genomic DNA was probably clonal, as genetic heterogeneity was undetectable by several criteria including macrorestriction analysis (S. Casjens, unpublished data) and plasmid analysis of clonal derivatives of the B31 line13.
Sequencing. The B. burgdorferi genome was sequenced by a whole-genome random sequencing method previously applied to other microbial genomes20, 21, 22, 23, 24. An approximately 7.5-fold genome coverage was achieved by generating 19,078 sequences from a small insert plasmid library with an average edited length of 505 bases. The ends of 69 large insert lambda clones were sequenced to obtain a genome scaffold; 50% of the genome was covered by at least one lambda clone. Sequences were assembled using TIGR Assembler as described20, 21, 22, 23, 24, resulting in a total of 524 assemblies containing at least two sequences, which were clustered into 85 groups based on linking information from forward and reverse sequence reads. All Borrelia sequences that had been mapped were searched against the assemblies in an attempt to delineate which were derived from the various elements of the B. burgdorferi genome. Some contigs were also located on the existing physical map by Southern analysis. Sequence and physical gaps for the chromosome were closed as described20, 21, 22, 23, 24. At the completion of the project, less than 3% of the chromosome had single-fold coverage. The linear chromosome of B. burgdorferi has covalently closed hairpin structures at its termini that are similar to those reported for linear plasmids in this organism11. The telomeric sequences (106 and 72 bp, respectively, from the left and right ends) were obtained after nicking the terminal loop with S1 nuclease and amplifying terminal sequences by ligation-mediated polymerase chain reaction (PCR) as described28. The unknown terminal sequence was determined in both directions on four independent plasmid clones of the amplified DNA from each telomere. A minimum amount of S1 nuclease was used and, because of their sequence similarity to other Borrelia telomeres, it is likely that few, if any, nucleotides were lost from the B31 chromosomal telomeres in this process.
Identification of ORFs. Coding regions (ORFs) were identified using compositional analysis using an interpolated Markov model based on variable-length oligomers47. ORFs of >600 bp were used to train the Markov model, as well as B. burgdorferi ORFs from GenBank. Once trained, the model was applied to the complete B. burgdorferi genome sequence and identified 953 candidate ORfs. ORFs that overlapped were visually inspected, and in some cases removed. Non-overlapping ORFs that were found between predicted coding regions and >30 amino acids in length were retained and included in the final annotation. All putative ORFs were searched against a non-redundant amino-acid database as described, 21 ,22 ,23 ,24. ORFs were also analysed using 527 hidden Markov models constructed for several conserved protein families (PFAM v2.0) using HMMER 48. Families of paralogous genes were constructed by pairwise searches of proteins using FASTA. Matches that spanned at least 60% of the smaller of the protein pair were retained and visually inspected. Atotal of 94 paralogous gene families containing 293 genes were identified (Fig. 1 (PDF File: 1,306k)).
Identification of membrane-spanning domains (MSDs). TopPred49 was used to identify potential MSDs in proteins. A total of 526 proteins containing at least one putative MSD were identified, of which 183 were predicted to have more than one MSD. The presence of signal peptides and the probable position of a cleavage site in secreted proteins were detected using Signal-P as described23; 189 proteins were predicted to have a signal peptide. Lipoproteins were identified by scanning for a lipobox in the first 30 amino acids of every protein. A consensus sequence relaxed from that used for H. pylori23 was defined for the purpose of this search based on known or putative B. burgdorferi lipoprotein consensus sequences.