BLAST, or basic local alignment search tool, is an approach to rapid sequence comparison introduced in 1990 by bioinformaticist Stephen Altschul. It is often used to compare two large sets of sequences with one another, quickly uncovering long and significant similarities between a member of one set and a member of another.
BLAST is a fundamental tool in the field of bioinformatics, an aspect of which revolves around the comparison of huge amounts of genetic sequence data in order to find similarities, called homologs. BLAST enables this to be done over a large number of sequences quite quickly.
The BLAST algorithm is quite clever, and its speediness makes it possible to locate matches to sequences of reasonable size (1,000 to 10,000 characters in length) to sets of sequences much, much larger (all of GenBank, for instance), and do it in reasonable time.
In a nutshell, the BLAST algorithm is a heuristic search method that seeks words of length W that score at least T when aligned with the query and scored with a substitution matrix. Pairs that score T or better are extended in both directions in an attempt to fina a locally optimal ungapped alignment or HSP (high scoring pair) with a score of at least S or an E value lower than the specified thresholds. HSPs that meet these criteria will be returned by BLAST, provided they do not exceed the cutoff value specified for number of descriptions and/or alignments to report.
Lost yet? Let's explain what's going on in more detail. Let's start off by saying that BLAST takes several inputs, and understanding each of these inputs in turn leads to a greater understanding of how the BLAST algorithm works.
The first important element is the database of sequences that we will be searching through. There are various ways to prepare these for searching; the most common method is to divide them up into small pieces and record the location of the pieces, then assemble them using some known string searching algorithm, such as a suffix tree.
The second important element is the substitution matrices. A substitution matrix is a table which indicates the likelihood that one small piece of information can replace another without any significant problems. In terms of genetics, the question usually revolves around whether or not the change of one base pair or one amino acid can alter the resulting protein, and if it does, whether or not the alteration is significant enough to change the resulting protein. The impact of each of these changes is given a score, with a higher score indicating that not much change is evident, and a lower score meaning that a major unacceptable change is evident. Obviously, a perfect match has the highest possible score.
There are various methods for indexing the scoring matrices against the database. One method is that for each possible piece, a small list of sequences is assembled in order of substitution score from the matrix. This means that when a sequence is received, it can be broken up and compared to its list, then the elements in this list can be searched using a standard speedy exact-match string algorithm, like a suffix tree.
Obviously, the third element of importance is the search sequence or sequences. Without something to compare against the database, there would be no purpose in having a database, would there?
Now things begin to get interesting. For simplicity and terminology sake, let's say that we are only going to BLAST one sequence against a large database of sequences. The BLAST algorithm works by breaking this sequence down into small words of length W (the fourth element, which is passed as an option much like the search sequence is). These words are then compared to the database itself, already broken down into words of length W. These comparisons are usually through some sort of intermediary data structure based on a substitution matrix based sort.
The fifth element is a quality threshold level T. When the search occurs, each word in the search sequence is matched to words in the database that pass the basic quality threshold test; these are noted. Once all of these matches are found, then the sequence is slowly stretched out, extending beyond the word size both earlier and later in the search sequence, and is compared to the matches. If the stretching continues to maintain an acceptable average threshold (the sixth element) and the exact matching of the sequence is above another threshold (the seventh element), then it is considered a good match and BLAST will report the match. Otherwise, BLAST discards it.
It is important to note that with any reasonable level of cutoff settings, BLAST discards virtually every match that it finds. You can search a large set of sequences against a huge database and still get only a small number of matches using BLAST. That, in itself, is part of why BLAST is so efficient; it discards unacceptable matches very quickly and gets on to the next one.
BLAST is a clever tool that fills a need in an emerging scientific field. It enables the geneticist and the bioinformaticist to quickly conduct large scale searches against databases with results that actually have biological meaning, thanks to the substitution matrices. This alone is revolutionizing the field of genetics.