Publication Date:
2012-08-28
Description:
Modern analytical methods in biology and chemistry useseparation techniques coupled to sensitive detectors, such as gaschromatography-mass spectrometry (GC-MS) and liquid chromatography-massspectrometry (LC-MS). These hyphenated methods provide high-dimensionaldata. Comparing such data manually to find corresponding signals is a laborioustask, as each experiment usually consists of thousands of individual scans, eachcontaining hundreds or even thousands of distinct signals.In order to allow for successful identification of metabolites or proteinswithin such data, especially in the context of metabolomics and proteomics, anaccurate alignment and matching of corresponding features between two or moreexperiments is required. Such a matching algorithm should capture fluctuationsin the chromatographic system which lead to non-linear distortions on the timeaxis, as well as systematic changes in recorded intensities.Many different algorithms for the retention time alignment of GC-MS and LC-MSdata have been proposed and published, but all of them focus either on aligningpreviously extracted peak features or on aligning and comparing the complete rawdata containing all available features. Results: In this paper we introduce two algorithms for retentiontime alignment of multiple GC-MS datasets: multiple alignment bybidirectional best hits peak assignment and cluster extension (BiPACE) andcenter-star multiple alignment by pairwise partitioned dynamic time warping(CeMAPP-DTW). We show how the similarity-based peak group matchingmethod BiPACE may be used for multiple alignment calculation individually and how it can be usedas a preprocessing step for the pairwise alignments performed by CeMAPP-DTW. We evaluate thealgorithms individually and in combination on a previously published small GC-MS dataset studying the Leishmania parasite and on a larger GC-MS dataset studying grains of wheat (Triticum aestivum). Conclusions: We have shown that BiPACE achieves very high precision and recall anda very low number of false positive peak assignments on both evaluation datasets. CeMAPP-DTW finds a high number of true positives when executed on its own,but achieves even better results when BiPACE is used to constrain its search space. The source code of both algorithms is included in the OpenSource software framework Maltcms, which is available from http://maltcms.sf.net. The evaluation scripts of the present study are available from the same source.
Electronic ISSN:
1471-2105
Topics:
Biology
,
Computer Science
Permalink