Publication Date:
2013-10-19
Description:
Motivation: Recently, a range of new statistics have become available for the alignment-free comparison of two sequences based on k -tuple word content. Here, we extend these statistics to the simultaneous comparison of more than two sequences. Our suite of statistics contains, first, and , extensions of statistics for pairwise comparison of the joint k -tuple content of all the sequences, and second, , and , averages of sums of pairwise comparison statistics. The two tasks we consider are, first, to identify sequences that are similar to a set of target sequences, and, second, to measure the similarity within a set of sequences. Results: Our investigation uses both simulated data as well as cis -regulatory module data where the task is to identify cis -regulatory modules with similar transcription factor binding sites. We find that although for real data, all of our statistics show a similar performance, on simulated data the Shepp-type statistics are in some instances outperformed by star-type statistics. The multiple alignment-free statistics are more sensitive to contamination in the data than the pairwise average statistics. Availability: Our implementation of the five statistics is available as R package named ‘multiAlignFree’ at be http://www-rcf.usc.edu/~fsun/Programs/multiAlignFree/multiAlignFreemain.html . Contact: reinert@stats.ox.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.
Print ISSN:
1367-4803
Electronic ISSN:
1460-2059
Topics:
Biology
,
Computer Science
,
Medicine
Permalink