(
bioinformatics,
computational biology:)
Algorithm for calculating
global sequence similarity and
sequence alignment. The 2
sequences are considered
words over some small
alphabet (typically 4
letters for
DNA sequences, 20 for
protein sequences). The algorithm is allowed to
insert gaps into each sequence, in order to make as many identical letters line up. Each
configuration is
scored in the following way:
- An exact letter match gets MATCH.
- A mismatch (2 different letters) gets MISMATCH.
- Every gap gets GAPEXT
- Additionally, GAPOPEN is paid for every gap opened, i.e. for every sequence of consecutive gaps.
Thus, if MATCH=11, MISMATCH=-4, GAPOPEN=-3, and GAPEXT=-2, then this configuration:
A-CGTTTACGG-T
AAGG--TACG--C
would score
6*GAPEXT+4*GAPOPEN+2*MISMATCH+6*MATCH = 34.
The input of the algorithm is a pair of sequences; the output is the alignment with a highest score (an optimal alignment); both entire sequences are required to participate in it. In 1979, Needleman and Wunsch published a paper with their dynamic programming algorithm for calculating optimal alignments. Of course, Bellman had already solved similar problems (e.g. edit distance for words) in the 50s and early 60s. But those were considered in different fields.