Publication Date:
2014-02-05
Description:
Background: New technologies for analyzing biological samples, like next generation sequencing, are producing agrowing amount of data together with quality scores. Moreover, software tools (e.g., for mappingsequence reads), calculating transcription factor binding probabilities, estimating epigeneticmodification enriched regions or determining single nucleotide polymorphism increase this amountof position-specific DNA-related data even further. Hence, requesting data becomes challenging andexpensive and is often implemented using specialised hardware. In addition, picking specific data asfast as possible becomes increasingly important in many fields of science. The general problem ofhandling big data sets was addressed by developing specialized databases like HBase, HyperTable orCassandra. However, these database solutions require also specialized or distributed hardwareleading to expensive investments. To the best of our knowledge, there is no database capable of(i) storing billions of position-specific DNA-related records, (ii) performing fast and resource savingrequests, and (iii) running on a single standard computer hardware. Results: Here, we present DRUMS (Disk Repository with Update Management and Select option), satisfyingdemands (i) - (iii). It tackles the weaknesses of traditional databases while handling position-specificDNA-related data in an efficient manner. DRUMS is capable of storing up to billions of records.Moreover, it focuses on optimizing relating single lookups as range request, which are neededpermanently for computations in bioinformatics. To validate the power of DRUMS, we compare it tothe widely used MySQL database. The test setting considers two biological data sets. We usestandard desktop hardware as test environment. Conclusions: DRUMS outperforms MySQL in writing and reading records by a factor of two up to a factor of10000. Furthermore, it can work with significantly larger data sets. Our work focuses on mid-sizeddata sets up to several billion records without requiring cluster technology. Storing position-specificdata is a general problem and the concept we present here is a generalized approach. Hence, it can beeasily applied to other fields of bioinformatics.Keywords
Electronic ISSN:
1471-2105
Topics:
Biology
,
Computer Science
Permalink