© 2009 The Authors. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in BMC Biology 7 (2009): 72, doi:10.1186/1741-7007-7-72.
Recent advances in sequencing strategies make possible unprecedented depth and scale of sampling for molecular detection of microbial diversity. Two major paradigm-shifting discoveries include the detection of bacterial diversity that is one to two orders of magnitude greater than previous estimates, and the discovery of an exciting 'rare biosphere' of molecular signatures ('species') of poorly understood ecological significance. We applied a high-throughput parallel tag sequencing (454 sequencing) protocol adopted for eukaryotes to investigate protistan community complexity in two contrasting anoxic marine ecosystems (Framvaren Fjord, Norway; Cariaco deep-sea basin, Venezuela). Both sampling sites have previously been scrutinized for protistan diversity by traditional clone library construction and Sanger sequencing. By comparing these clone library data with 454 amplicon library data, we assess the efficiency of high-throughput tag sequencing strategies. We here present a novel, highly conservative bioinformatic analysis pipeline for the processing of large tag sequence data sets.The analyses of ca. 250,000 sequence reads revealed that the number of detected Operational Taxonomic Units (OTUs) far exceeded previous richness estimates from the same sites based on clone libraries and Sanger sequencing. More than 90% of this diversity was represented by OTUs with less than 10 sequence tags. We detected a substantial number of taxonomic groups like Apusozoa, Chrysomerophytes, Centroheliozoa, Eustigmatophytes, hyphochytriomycetes, Ichthyosporea, Oikomonads, Phaeothamniophytes, and rhodophytes which remained undetected by previous clone library-based diversity surveys of the sampling sites. The most important innovations in our newly developed bioinformatics pipeline employ (i) BLASTN with query parameters adjusted for highly variable domains and a complete database of public ribosomal RNA (rRNA) gene sequences for taxonomic assignments of tags; (ii) a clustering of tags at k differences (Levenshtein distance) with a newly developed algorithm enabling very fast OTU clustering for large tag sequence data sets; and (iii) a novel parsing procedure to combine the data from individual analyses. Our data highlight the magnitude of the under-sampled 'protistan gap' in the eukaryotic tree of life. This study illustrates that our current understanding of the ecological complexity of protist communities, and of the global species richness and genome diversity of protists, is severely limited. Even though 454 pyrosequencing is not a panacea, it allows for more comprehensive insights into the diversity of protistan communities, and combined with appropriate statistical tools, enables improved ecological interpretations of the data and projections of global diversity.
The International Census of Marine Microbes and the W.M. Keck Foundation award to the Marine Biological Laboratory at Woods Hole (MA) supported the pyrosequencing part of this study. Further financial support came from a grant from the Deutsche Forschungsgemeinschaft to TS (STO414/3-1). Support for the unpublished work on Cariaco Basin protists came from NSF MCB-0348407 to VE (collaborative project with S Epstein at Northeastern University, Boston, MA, USA). Financial support to AC was provided by NSF MCB-0348045. Financial support to RC was provided by the ANR-Biodiversité project Aquaparadox.
Woods Hole Open Access Server