Parallel Genomic Sequence-Searching on an Ad-Hoc Grid:
Experiences, Lessons Learned, and Implications
Heshan Lin
Abstract
The Basic Local Alignment Search Tool (BLAST) allows bioinformaticists
to characterize an unknown sequence by comparing it against a database
of known sequences. The similarity between sequences enables
biologists to detect evolutionary relationships and infer biological
properties of the unknown sequence. mpiBLAST, our parallel BLAST,
decreases the search time of a 300KB query on the current NT database
from over two full days to under 10 minutes on a 128- processor
cluster and allows larger query files to be compared. Consequently, we
propose to compare the largest query available, the entire NT
database, against the largest database available, the entire NT
database. The result of this comparison will provide critical
information to the biology community, including insightful
evolutionary, structural, and functional relationships between every
sequence and family in the NT database. Preliminary projections
indicated that to complete the above task in a reasonable length of
time required more processors than were available to us at a single
site. Hence, we assembled GreenGene, an ad-hoc grid that was
constructed "on the fly" from donated computational, network, and
storage resources during last year's SC|05. GreenGene consisted of
3048 processors from machines that were distributed across the United
States. This paper presents a case study of mpiBLAST on GreenGene -
specifically, a pre-run characterization of the computation, the
hardware and software architectural design, experimental results, and
future directions.