ISSN:
1573-2886
Keywords:
multiple sequence alignment
;
alignment number
;
protein sequences
;
motif discovery
;
MAX SNP hard
;
approximate algorithm
;
set covering problem
Source:
Springer Online Journal Archives 1860-2000
Topics:
Mathematics
Notes:
Abstract Given a set of N sequence, the Multiple Sequence Alignment problem is to align these N sequences, possibly with gaps, that brings out the best commonality of the N sequences. The quality of the alignment is usually measured by penalizing the mis-matches and gaps, and rewarding the matches with appropriate weight functions. However for larger values of N, additional constraints are required to give meaningful alignments. We identify a user-controlled parameter, an alignment number K (2 ≤ K ≤ N): this additional requirement constrains the alignment to have at least K sequences agree on a character, whenever possible, in the alignment. We identify a natural optimization problem for this approach called the K-MSA problem. We show that the problem is MAX SNP hard. We give a natural extension of this problem that incorporates “biological relevance” by using motifs (common patterns in the sequences) and give an approximation algorithm for this problem in terms of the motifs in the data. MUSCA is an implementation of this approach and our experimental results indicate that this approach is efficient, particularly on large numbers of long sequences, and gives good alignments when tested on biological data such as DNA and protein sequences.
Type of Medium:
Electronic Resource
URL:
http://dx.doi.org/10.1023/A:1009841927822
Permalink