Expanding Wolbachia genomics with 1,005 newly assembled genomes
We screened for the presence of Wolbachia sequences in more than 70TB of raw DNA sequencing data from 31,825 publicly available genome sequencing samples from more than 500 putative host species ranging from nematodes and insects to crustaceans, but also including various non-ecdysozoan species (Supplementary Data 4). Raw reads were mapped against a set of 43 decontaminated available Wolbachia draft genomes using a PanPhlAn24-based pipeline (see Methods).
We found a total of 1793 samples (5.63%) positive for Wolbachia. This general low infection frequency is due to many samples being from model organisms including Aedes aegypti, Caenorhabditis elegans and Anopheles gambiae which are supposed to be Wolbachia-free or only very rarely infected with Wolbachia25. We verified the exact identity of all host species by reconstructing their 18S rRNA small subunit (see Supplementary Information and Supplementary Data 5). We found large differences in the level of infection frequency (prevalence) in the various hosts (Fig. 1a and Supplementary Data 4)) with values ranging from 100% in Diachasma alloeum to <5% in Nasonia vitripennis. Large differences exist even within the same genus. Wolbachia prevalence in Drosophila simulans and D. ananassae is twice of that in D. melanogaster and almost ten times higher than in other Drosophila species, confirming observations of similar large variations reported in other genera26. Despite consistency in the analytical pipeline used here across samples, these quantitative prevalence results should be taken with care because of likely sampling biases not traceable in sequence archives such as the presence of antibiotic-treated host strains, differences in DNA extraction protocols, and natural versus laboratory environments. The number of Wolbachia cells per host cell (normalized by genome size to approximate the actual titer) was also quite dissimilar with most estimates ranging from 3:1 to 10:1 (Wolbachia cells: host cells); exceptions are D. simulans (infected by wRi, 25:1) and the coleopteran Callosobruchus chinensis (infected by our newly assembled wCch 17:1).
We then metagenomically assembled 1166 genomes—called metagenome-assembled genomes (MAGs)—from our 1793 positive samples (see “Methods”), by selecting those characterized by high Wolbachia coverage. We verified that our Wolbachia MAGs did not contain host genome integrated fragments (see Supplementary Information), and excluded contaminations from other bacteria. By mapping all MAGs against nearly 100,000 bacteria reference genomes, we found only one instance of contamination. Using a genome-wide maximum-likelihood approach with PhyloPhlAn27 (Fig. 1b) we revealed a highly heterogeneous distribution of various genetic characteristics (genome length, nucleotide composition) and biological traits (normalized abundance: see Methods and Supplementary Fig. 2). In some cases the phylogenetic structure of Wolbachia could be linked with certain traits as in the case of supergroup C characterized by low GC content and reduced genome length, or supergroup D characterized by high abundance.
We defined MAGs quality based on four main criteria (see “Methods” and Fig. 1b) as a strict control to retain a total of 1005 Wolbachia MAGs, which we have used to infer a whole-genome phylogeny based on the alignment of 316 core genes (Fig. 2a). Although MAGs are inherently less accurate than genomes obtained by isolate sequencing, the resulting maximum-likelihood tree is overall very robust with most nodes receiving full (100% bootstrap) support. We complemented this phylogeny with an analysis of the presence and absence of the 6376 gene families composing the pangenome of the 1793 Wolbachia-positive samples using PanPhlAn24 (Fig. 2b). The two trees show a similar topology as further indicated by comparison of branch lengths (Fig. 2c, ρ = 0.94, p < 1E–50, Spearman correlation). The position of supergroups C and D is however an exception. This is likely due to shared functional properties between F, A, and B, which may have promoted a Long Branch Attraction artifact28 of C and D with functionally distant outgroups. Overall, our analysis increased the diversity of available Wolbachia genomes by more than 60% reaching 1166 Wolbachia genome and increased by more than 50% the number of host species with an assembled Wolbachia genome (from 33 as of July 2018 to 55), providing an useful dataset for comparative genomic and phylogenetic studies (Supplementary Data 5).
The catalog of reconstructed strains expands the Wolbachia sampled diversity
Our Wolbachia genome resource contains new MAGs of previously described data in four host species (Drosophila melanogaster, D. simulans, D. ananassae, Onchocerca vulvus)16,17,18,29 and crucially expands Wolbachia diversity with newly assembled genomes from both known and unknown hosts (white circles in Fig. 2a). For example, we enlarged the genomic diversity in D. melanogaster and D. simulans by adding new genomes to reach more than 500 quality controlled genomes each, including the Wolbachia genomes for D. mauritiana30, D. yakuba and D. santomea31. We similarly provide additional assembled genomes for various nematodes of medical relevance such as Wuchereria bancrofti and Brugia pahangi, and genomes in insect of economic importance such as the corn rootworm Diabrotica virgifera and the bedbug Cimex lecturalis. These collections of Wolbachia genomes from the same host (black circles in Fig. 2a) provided us with data for Wolbachia intra-host population studies, as addressed below. We also assembled many genomes for unrepresented insect orders Lepidoptera (e.g: butterflies Polygonia c-album and Pararge aegeria), Hemiptera (e.g., the pest species Homalodisca vitripennis and Dactylopius coccus), and Hymenoptera (e.g: parasitoid wasps Diachasma alloeum and two bee species). We assembled the first Wolbachia genome from an arachnid host, the mite Tetranychus urticae. Overall, the majority of our newly assembled genomes are from insects and nematodes because hosts such as terrestrial crustaceans and arachnids are poorly represented in sequence archives. A detailed description of newly detected strains is given in Supplementary information.
Evolutionary scenarios highlighted by the new Wolbachia MAGs
We found evidence of high host intraspecific Wolbachia genetic diversity. In some cases, this variability was geographically linked as for the gall wasp Biorhiza pallida and the ant Cardiocondyla obscurior (in purple in supergroup A, Fig. 2b), or in the medical relevant tiger mosquito Aedes albopictus and tsetse fly Glossina morsitans (orange in, respectively, supergroup A and B, Fig. 2b). We also found new strains in D. simulans which are robustly separated (high bootstrap support) from known variants resident in this host such as the wAu-like strains in the box of Fig. 2a. This finding points toward the peculiar role of D. simulans as a reservoir of Wolbachia diversity in nature with at least seven distinct Wolbachia genome types observed in this species.
In some cases we identified Wolbachia strains very closely related to well-known reference Wolbachia genomes that are however present in unexpected hosts. We found, for example, a genome from the robber fly Holcocephala fusca having a 99.52% core genetic identity with wMel of D. melanogaster: Although 18S rRNA gene screening confirms this SRA as Holcocephala, we found some reads with high similarity to the cytochrome oxidase subunit I gene (COI) of hoverflies. We therefore cannot exclude that our reconstructed Wolbachia genome is not from the robber fly itself, but from a hoverfly prey. Because we can exclude contamination from a Drosophila prey (Supplementary Information), this strain likely indicates a recent horizontal transfer involving distant hosts and showcases the complexity of the Wolbachia–Drosophila symbiotic models. We further identified a dubious new strain in the nematode Caenorhabditis remanei with a 98.04% identity in gene content with wNo of D. mauritiana (Fig. 2b). While we could verify the host based on 100% similarity with the annotated COI gene of C. remanei, we also found some reads covering a portion of D. mauritiana COI. Since no Wolbachia has ever been found in the Caenorhabditis genus, it is possible that this Wolbachia is the result of some kind of contamination. Indeed, the genome from this sample did not pass our quality control and was considered only for our gene-content tree (Fig. 2b and not Fig. 2a). These two cases highlight the difficulty in determining the exact source of some samples when reconstructing endosymbiont genomes using a metagenomic approach.
As expected from a symbiont that can transfer horizontally between host species, we did not observe a strict or meaningful phylogenetic clustering of hosts in our Wolbachia phylogenies. We however identify some peculiar patterns, for example a clustering of Lepidoptera and a concentration of Hymenoptera in group A (respectively blue and purple in Fig. 2). The latter have a clear paraphyletic distribution on the tree consistent with a scenario where Hymenoptera in general (and not only parasitoids32) may have played a key role in the differentiation of Wolbachia supergroup A.
Lineage-specific acquisition of gene families in supergroups
Our functional pangenome analysis (Fig. 3a) showcased large gene content divergence even within supergroups (in particular A). We performed a more stringent host-specific functional enrichment analysis by selecting 989 genomes from 14 different hosts with multiple assemblies. Investigating the functional enrichments and depletions (enzyme groups) on the phylogeny (Fig. 3b and Supplementary Fig. 1) we found that some functionalities have been independently lost in different host-specific lineages including for example the UDP-N-acetylmuramate-alanine ligase and the Glycerol-3-phosphate acyltransferase. Conversely, the Holo-acyl-carrier-protein synthase (EC 220.127.116.11) was instead acquired in the ancestral lineage of supergroup A. Overall, we observe more functional acquisitions and depletions in supergroups C, D, and F, than in supergroups A and B. This does not seem to be a bias related to accelerated mutation rate in C, D, and F, as their actual phylogenetic branch lengths (therefore number of mutations assuming a similar mutation rate) are similar to that of A and B (Fig. 2a). More likely, it reflects that most genomes in C, D, and F belong to highly adapted strains with degraded genomes.
To further identify putative genes responsible for CI, we compared strains known to cause CI with strains that are assumed not to possess this phenotype (Fig. 3c). Because of the relatively high gene variability between strains, the number of Wolbachia genomes considered here is crucial to pinpoint CI-specific genes. We compared 11 non-CI wAu-like assemblies with hundreds of CI related assemblies from wRi-like and wMel, using nematode lineages (B. malayi and O. ochengi) to correct for false CI gene loss (see Methods). For this analysis we selected only very similar genomes and made the assumption that genomes that are very similar to the ones causing (or not causing) CI also cause (or not cause) CI: in the absence of phenotypic information from a large panel of Wolbachia, our assumption was necessary to perform a pangenome analysis which is statistically robust against random sampling. We found that among the 11 candidate genes significantly enriched in CI inducing genomes (Fisher’s exact test, p < 10−05, Bonferroni corrected, Supplementary Data 6), five of them, including cifA and cifB were previously identified by LePage et al.11. Our pangenome analysis reveals six additional genes with functional annotations related to, among others, riboflavin, benzoate, and a bacteriophage. While significantly enriched phage genes might not be surprising, as the CI loci are located in a prophage region, other candidates may play a role in CI biology and should be investigated further experimentally.
Correspondence of co-phylogenies and intraspecific horizontal transfers
Our extended genomes catalog allowed us to explore Wolbachia evolutionary dynamics within host populations, in particular, to assess the degree of intraspecific horizontal transmission. We used a reference-based mapping as in previous population studies16,18,32 to obtain longer Wolbachia genome alignments and assembled 1,149 mitochondrial genomes of the corresponding hosts to evaluate their co-phylogeny in eleven species (Fig. 4a).
Our co-phylogenies indicate a good correspondence between intraspecific host relationships and their corresponding Wolbachia: this is consistent with a prevalently vertical rather than horizontal inheritance as previously shown in D. melanogaster16. We found however numerous exceptions. All hosts species, except for the poorly sampled D. mauritiana, are characterized by one or more cases of phylogenetic incongruence which corresponds to likely cases of intraspecific horizontal Wolbachia transmission (dotted lines in Fig. 4a; the actual cases of horizontal transfer may be more than those apparent from our analyses because we considered only highly supported incongruences). For D. melanogaster and D. ananassae we observed some instances of horizontal transmission, in conflict with previous studies using less data16,29,32. We further found hints of peculiar inheritance mechanisms. Exceptional is the case of the bug D. citri which is characterized by clonal (identical) mitochondria, but whose Wolbachia population is well structured. This may reflect recent host infections from a pool of genetically diverse Wolbachia or a case of recent intraspecific transmission15. Of note is also the case of D. simulans whose Wolbachia are almost indistinguishable, while the host has some mitochondrial structure compatible with the recent rapid spread of wRi within this host33. To check for possible contaminants that may have misled our co-phylogenies, we further explored possible host-derived sequences using the procedure of Chung et al.34 (see Methods) and found on average about 0.1% positions of potential sequence variant contamination across all assemblies. We detected various instances of increased sequence variations at 1–6% length, but only eight assemblies with increased coverage regions at 1–3% length mostly in nematode hosts. (Supplementary Data 5). Because Wolbachia-host lateral gene transfers (LGTs) shall generate a significant increase in coverage variation, we conclude that only eight of our MAGs are potentially affected by misleading signal related to LGT, None of these MAGs are involved in any of the putative within host transfers (dotted line of Fig. 4a) or exhibit any peculiar rate of evolution in Fig. 2a. However, because we found a certain level of genetic heterogeneity, we cannot completely exclude confounding factors from multiple Wolbachia infections in some of the MAGs. For D. ananassae we further manually excluded bias from wAnaINTs (Wolbachia genomes integrated in host genomes) data accidentally included in our assemblies, but could not find fragments attributable to integrated Wolbachia (see Supplementary Information); furthermore, as integrated genomes evolve neutrally, or almost neutrally29, they cannot produce a phylogenetic signal incompatible with that of the host genome. Because of its peculiar integrating genome biology, we nonetheless advocate caution in the interpretation of our results for wAna.
Rates of Wolbachia evolution correlates with nuclear, but not with mitochondrial genome of hosts
An intriguing outcome of our co-phylogenies is the striking heterogeneity of the evolutionary rate of Wolbachia compared to that of the host’s mitochondrial DNA (mtDNA). While we could reproduce previous finding that Wolbachia evolves about one order of magnitude (circa 10 times) slower than mitochondria in D. melanogaster16, we could observe a similar pattern only in D. mauritiana, and to a lesser extent in D. simulans and D. ananassae. In all other hosts we observe different patterns: Wolbachia rates are approximately the same as mitochondria in Onchocerca nematodes, and two/three times faster than mitochondria in the wasp D. alloeum, the beetle D. virgifera and the bedbug C. lecturalis. These discrepancies seem to correlate with host biology traits such as generation time and population size: while Drosophila, for example, is characterized by many generations per year and large effective population size35, most other species are characterized by less generation per year and smaller population size36,37.
Our co-phylogenies reveal major desynchronization between the Wolbachia and the mtDNA clocks. We did not infer co-phylogenies using host nuclear data because, while mtDNA and Wolbachia are uniparentally inherited, nuclear genes follow a Mendelian inheritance and their coalescence impedes building a genealogical (tree-like) structure38. We have however estimated average genetic divergence for a typical universal animal nuclear marker, the 18S rRNA gene5 and compared the average genetic distances of each of the 11 Wolbachia populations with that of the mtDNA and 18S of the corresponding hosts. As anticipated by the different relative rates of the co-phylogenies (Fig. 4a), there is no correlation between Wolbachia and mtDNA divergences (Fig. 4b). Instead, we found a significant correlation between Wolbachia and 18S rRNA gene evolution (Fig. 4b). This indicates that Wolbachia is indeed following the molecular clock of the hosts, but only for what concerns its nuclear genome, not the mitochondrial one. By contrast, the mtDNA seems to be characterized by a rate that departs from both Wolbachia and, apart from a few cases, the nuclear data. One possible explanation is that the mitochondrial genomes of 4-5 species are characterized by an unusual within-species genetic homogeneity possibly due to specific inheritance genetics or a constrained selective regime. Indeed, when we exclude these species (circled in Fig. 4b), we recover a significant correlation, and notably with a similar slope of the Wolbachia−18S correlation (dotted line in Fig. 4b). This finding deserves future testing in particular by analyzing large chunks of the host nuclear genome and by expanding these comparisons to more populations. Our data is however clear in showing that the pattern and rate of Wolbachia evolution may be very dissimilar to that of their mtDNA host, pointing at a variably and species-specific Wolbachia-mtDNA evolution, in which the Drosophila-Wolbachia dynamics16 seems to be the exception rather than the norm.