Publication Date:
2020-08-26
Description:
Motivation Mash is a popular hash-based genome analysis toolkit with applications to important downstream analyses tasks such as clustering and assembly. However, Mash is currently not able to fully exploit the capabilities of modern multi-core architectures, which in turn leads to high runtimes for large-scale genomic datasets. Results We present RabbitMash, an efficient highly optimized implementation of Mash which can take full advantage of modern hardware including multi-threading, vectorization, and fast I/O. We show that our approach achieves speedups of at least 1.3, 9.8, 8.5, and 4.4 compared to Mash for the operations sketch, dist, triangle, and screen, respectively. Furthermore, RabbitMash is able to compute the all-vs-all distances of 100,321 genomes in less than 5 minutes on a 40-core workstation while Mash requires over 40 minutes. Availability RabbitMash is available at https://github.com/ZekunYin/RabbitMash Supplementary information Supplementary data are available at Bioinformatics online.
Print ISSN:
1367-4803
Electronic ISSN:
1460-2059
Topics:
Biology
,
Computer Science
,
Medicine
Permalink