Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6

Biological sequence analysis on GPU

N. Moreano¹; A.C.M.A.de Melo² ¹ Federal University of Mato Grosso do Sul, Campo Grande, Brazil
² University of Brasilia,Brasilia, Brazil

Abstract

High-throughput techniques for DNA sequencing have led to an exponential growth of biological databases. These biological sequences have to be analyzed and interpreted in order to determine their function and structure. Therefore biological sequence analysis tools need to deal with an ever-growing amount of data. Unfortunately, these databases grow faster than the core performance of single processor. Biological sequence analysis tools can take advantage of parallel computing and the emergence of multicore architectures, such as graphics processing units (GPUs), which provide the opportunity to significantly reduce the execution time of these tools. This chapter discusses the recent advances in GPU solutions for two main sequence comparison problems, pairwise sequence comparison and sequence-profile comparison.

Keywords

Bioinformatics; Pairwise sequence comparison; Sequence-profile comparison; High-performance solution; GPU

1 Introduction

Bioinformatics represents an interdisciplinary and rapidly evolving area of science that applies mathematics, statistics, computer science, and biology to the understanding of living systems. Bioinformatics is driven by the advent of fast and reliable technology for sequencing nucleic acids and proteins that results in an ever-increasing volume of experimental data to be analyzed. Many of the recent developments in the field use algorithmic techniques in order to reach answers to key challenges in molecular biology research, including understanding the mechanisms of genome evolution, elucidating the structure of protein interaction networks, and determining the genetic basis for susceptibility to disease [1].

A major application of Bioinformatics is the analysis of the DNA and protein sequences of organisms that have been sequenced. Sequence comparison is one of the basic operations in Bioinformatics, serving as a basis for many other more complex manipulations. It provides important information for solving many key problems, such as determining the function of a newly discovered sequence, determining the evolutionary relationships among genes and proteins, and predicting the structure and function of proteins [2].

When a new biological sequence is discovered, its function and structure must be determined. A common approach is to compare the new sequence to known sequences belonging to biological databases, in search for similarities. We can compare a sequence to another sequence, performing a pairwise sequence comparison, which consists of deciding whether a pair of sequences are evolutionary related, that is, whether they share a common evolutionary history. We can also compare a sequence to a profile that models a family of sequences, performing a sequence-profile comparison, which consists of deciding whether a sequence is evolutionarily related to a known evolutionary family sequence.

When we recognize a significant similarity between a new sequence and a known sequence or sequence family, we can transfer information about structure and/or function to the new sequence. We say that the sequences are homologous and that we are transferring information by homology [3].

Comprehensive databases of DNA and protein sequences are now established as major tools in current molecular biology research. Given the advances in sequencing technologies, the significant amount of biological sequence data produced, and the effectiveness of sequence comparison, it is logical to systematically organize and store the biological sequences to be compared. As a consequence, sequence databases have grown exponentially in the last decade.

The most accurate algorithms for solving the problems of pairwise sequence comparison and sequence-profile comparison are usually based on the dynamic programming technique. Because of the quadratic time and memory complexity of these algorithms and usually the long length of biological sequences, the task of searching large databases can lead to very lengthy execution times with huge memory requirements.

High-performance computing resources and techniques can be used to accelerate these operations. Several solutions for parallel sequence comparison have been proposed, targeting different high-performance platforms, such as multicore architectures, clusters, and field-programmable gate arrays (FPGAs).

Graphics processing units (GPUs) have evolved into highly parallel platforms due to their vast number of simple, data-parallel, deeply multithreaded cores. Their impressive computational power, high memory bandwidth, and comparatively low cost make them an attractive platform to solve problems based on computationally intensive algorithms. Moreover, GPUs are becoming increasingly programmable, offering the potential of significant speedups for a wide range of applications compared to general-purpose processors (CPUs).

The goal of this chapter is to discuss in detail and compare the recent advances in GPU solutions for some biological sequence analysis applications. The problems discussed are two classes of biological sequence comparison: pairwise sequence comparison and sequence-profile comparison. The first one is widely used as a first step in the solution of complex problems such as the determination of the evolutionary history of the species. The second one is extremely important because it is used to decide whether a recently sequenced protein belongs to a particular protein family. For both problems, several GPU solutions have been proposed that obtained substantial speedups over the sequential implementation and over solutions in other parallel platforms.

The remainder of the chapter is organized as follows. In Section 2, we introduce the pairwise sequence comparison and sequence-profile comparison problems and present the most widely used algorithms to solve them. Section 3 presents the main design aspects that are considered in the development of GPU solutions for sequence comparison. In Section 4, several state-of-the-art GPU solutions to the pairwise sequence comparison problem are discussed and compared. In Section 5, we discuss and compare several state-of-the-art GPU solutions for the sequence-profile comparison problem. Finally, Section 6 concludes the chapter, presenting the future tendencies in this research area.

2 Pairwise Sequence Comparison and Sequence-Profile Comparison

A biological sequence comprises a single and continuous molecule of either nucleic acid or protein. A DNA sequence is a chain of simpler molecules called nucleotide bases. Only four different bases are used in DNA sequences, and they are represented by the residues A, C, T, and G. A protein is also a chain of smaller molecules called amino acids, and there are 20 different amino acids, represented by 20 different residues. Therefore we can model a biological sequence (DNA or protein sequence) as a linear list of residues.

One goal of sequence comparison is to enable the researcher to determine whether two sequences or a sequence and a sequence family display sufficient similarity such that an inference of homology can be made. Homology refers to a conclusion drawn from these data that the sequences are evolutionarily related. The changes that occur during the course of the evolutionary process can be categorized as residue substitutions, insertions, and deletions [4].

2.1 Pairwise Sequence Comparison

Pairwise biological sequence comparison is a very important operation in Bioinformatics projects, often being used to define the similarity between two sequences. In a broad sense, the sequences compared are (a) a protein sequence versus a protein sequence that belongs to a genomic database, (b) a small DNA sequence versus a reference genome, or (c) two long DNA sequences. Pairwise sequence comparison is also used as a building block to solve Bioinformatics problems such as DNA assembly and phylogenetic tree construction, among others. Nowadays, such comparisons are executed daily, thousands of times, all over the world.

The algorithm used to perform pairwise sequence comparisons can be exact, either providing the optimal result as output, or heuristic, or providing as output a result that is not guaranteed to be optimal but that is a good solution for the problem.

The pairwise sequence comparison produces a score as output, which is a measure of the similarity between the sequences and an alignment that highlights the parts of the sequences that are similar/distinct. An alignment can be (A) global, composed of all the characters of the sequences; (B) local, composed of a subset of the characters of the sequences; or (C) semiglobal, where the prefix or suffix of one of the sequences is discarded. Fig. 1 illustrates these types of alignments between sequences S₁ = GCATTCGATC and S₂ = ACGAT.

f06-01-9780128037386 — Fig. 1 Global, local, and semiglobal alignments. The punctuations for matches, mismatches, and gaps are + 1, − 1, and − 2, respectively. (A) Global alignment. (B) Local alignment. (C) Semiglobal alignment.

In order to calculate the score of an alignment, punctuations are given to matches (identical characters), mismatches (different characters), and gaps. If DNA or RNA sequences are compared, matches and mismatches usually have a unique punctuation, regardless of the characters compared (i.e., the match A, A is equal to the match C, C). On the other hand, the punctuation of matches and mismatches for the amino acids that compose protein sequences is usually retrieved from substitution matrices that contain punctuations for individual cases. The most popular substitution matrices are PAM and BLOSUM [5].

In all pairwise sequence comparison algorithms, the addition of gaps in the alignment results in a penalty, which is calculated using two distinct gap models: linear and affine gap. In the linear model, each gap has the same penalty, which is a negative value. In the affine gap model, the penalties for gaps are also negative, but the penalty for the first gap is higher than the penalty for the subsequent gaps. The affine gap model produces alignments that are more biologically significant than the linear model since gaps tend to occur together in nature [6].

The most widely used exact algorithms that compare two sequences are (a) Needleman-Wunsch (NW) [7], which is used to obtain optimal global alignments with linear gap penalties; (b) Smith-Waterman (SW) [8], which obtains optimal local alignments with linear gap penalties; (c) Gotoh [9], which computes the affine gap model to obtain optimal global alignments, and (d) Myers-Miller (MM) [10], which computes optimal global alignments in linear memory space.

These four algorithms use the dynamic programming technique to calculate similarity matrices and have quadratic time complexity O(m × n), where m and n are the lengths of the sequences. Also, they have the same data dependency, where the value of each cell (i, j) in the matrix depends on the values of three previously calculated cells: (i − 1, j − 1), (i − 1, j), and (i, j − 1). Algorithms [7–9] execute in two phases, where Phase 1 computes the dynamic programming matrices and obtains the optimal score and Phase 2 obtains the optimal alignment. Algorithm [10] executes in one phase, in a divide-and-conquer recursive manner.

The most widely used heuristic algorithms to compare two sequences are Basic Local Alignment Tool (BLAST) [11] and FASTA [12]. Both algorithms combine exact pattern matching with statistical analysis in order to obtain good alignments quickly.

In order to compare the performance of pairwise sequence comparison applications, the metric GCUPS (billion cells update per second) is often used. This metric is calculated by (m × n)/(t × 10⁹), where m and n are the lengths of the sequences and t is the total time to execute the comparison.

In the next sections, we detail the SW algorithm and BLAST.

2.1.1 Smith-Waterman algorithm

The SW algorithm [8] is an exact method based on dynamic programming to obtain the best local alignment between two sequences in quadratic time and space. It is divided into two phases: create the similarity matrix and obtain the best local alignment.

Phase 1: Create the similarity matrix

The first phase receives as input sequences s and t, with |s| = m and |t| = n, where |s| represents the length of sequence s. The notation used to represent the nth character of a sequence seq is seq[n], and to represent a prefix with n characters, from the beginning of the sequence, we use seq[1…n]. The similarity matrix is denoted A_{m+1, n+1}, where A_{i, j} contains the similarity score between prefixes s[1…i] and t[1…j].

At the beginning, the first row and column are filled with zeros. The remaining elements of A are obtained from Eq. (1), where ma is the punctuation for matches, mi is the punctuation for mismatches, and g is the gap penalty. The SW score between sequences s and t is the highest value contained in matrix A.

$\begin{array}{l} A_{i, j} = max \{\begin{array}{l} A_{i - 1, j - 1} + (if s [i] = t [j] then m a else m i) \\ A_{i, j - 1} - g \\ A_{i - 1, j} - g \\ 0 \end{array} \end{array}$ $\begin{array}{l} A_{i, j} = max \{\begin{array}{l} A_{i - 1, j - 1} + (if s [i] = t [j] then m a else m i) \\ A_{i, j - 1} - g \\ A_{i - 1, j} - g \\ 0 \end{array} \end{array}$

si1_e (1)

Phase 2: Obtain the best local alignment

The second phase is executed to obtain the best local alignment. The algorithm starts from the cell that contains the highest value and follows the arrows until a zero-valued cell is reached (Fig. 2). A left arrow in A_{i, j} indicates the alignment of s[i] with a gap in t. An up arrow represents the alignment of t[j] with a gap in s. Finally, an arrow on the diagonal indicates that s[i] is aligned with t[j].

f06-02-9780128037386 — Fig. 2 Smith-Waterman similarity matrix for pairwise comparison. In this example, the alignment has score = 5.

Smith-Waterman variations

Here we describe two variations of SW that are widely used: (a) affine-gap comparison and (b) linear space alignment retrieval.

The original SW algorithm assigns a single penalty for gaps, using the linear gap model. However, in nature gaps tend to occur together. Therefore, in order to better represent the biological phenomena, the penalty for the first gap (gap open) must be higher than the penalty for the remaining gaps (gap extend) in a sequence of consecutive gaps. This is called the affine gap model. Gotoh [9] proposed an algorithm that implements the affine gap model by using three dynamic programming matrices instead of one. Even so, time and space complexity remain quadratic.

The SW algorithm runs on quadratic space, which prevents it from being executed to compare long sequences. Hirschberg [13] proposed an algorithm that computes the long common subsequence in linear space, using a divide and conquer technique. MM [10] adapted Hirshberg’s algorithm to execute the Gotoh algorithm in linear space.

2.1.2 Basic Local Alignment Tool

BLAST was proposed by Ref. [11], and it is based on a heuristic algorithm designed to run fast while still maintaining high sensibility. The BLAST algorithm assumes that significant alignments have words of length w in common, and it is divided into three well-defined phases: seeding, extension, and evaluation.

Phase 1: Seeding

In the first phase, BLAST compares a query sequence s against all sequences t that compose a genomic database, using exact pattern matching for a given length w of substrings. For instance, if w is 3, the locations of all shared w-letter words between sequences s and t are determined. These locations are known as identical words and are used as seed candidates. The list of locations is evaluated using a substitution matrix and the concept of neighborhood. The neighbor of a word includes the word itself and every other word whose score is at least equal to T (cut-off score), when compared with a substitution matrix. A list of seeds is produced as output.

Phase 2: Extension

The goal of this phase is to extend the seeds generated in Phase 1, including some mismatches. This is done by inspecting the characters near each seed in both directions and concatenating them to the seed until a drop-off score X is reached. The drop-off score defines how much the score can be reduced. At the end of this phase, a list of local alignments is produced.

Phase 3: Evaluation

The alignments generated in Phase 2 are evaluated in order to remove those that are not significant. The significant alignments, called high-score segment pairs (HSPs), are the ones whose scores are higher than or equal to a threshold S. Also, consistent HSP groups are generated that include nonoverlapped HSPs. The consistent HSP groups are compared to a final threshold, known as the E-value, and only the alignments that are above this threshold are the output.

BLAST

Basically, BLAST has two versions, which compare proteins (BLASTP) and nucleotides (BLASTN), respectively. The original BLAST algorithm searched for local alignments without considering gaps. In 1994 and 1997, improved gapped versions of the original BLAST, WU-BLAST2 [14] and NCBI-BLAST2 [15] were proposed.

Because of the success in obtaining significant alignments fast, many BLAST variations and wrappers were proposed. Mega-BLAST [16], BLASTZ [17], and MPBLAST [18] form a nonexhaustive list of these variations. The goal of Mega-BLAST is to align very similar DNA sequences faster than BLAST. In order to achieve this goal, Mega-BLAST uses a greedy algorithm to reduce the execution time of the extension phase. BLASTZ is a variation of BLAST which requires that matching regions must occur in the same order in both sequences. Multiplexing BLAST (MPBLAST) concatenates small query sequences into one single query, thus reducing the BLAST search time.

2.2 Sequence-Profile Comparison

A sequence family is a set of sequences with a similar biological function, or a similar two- or three-dimensional structure, or a known common evolutionary history [5].

Many sequence analysis methods are based on determining the relationship of a newly discovered sequence to a known sequence family. If we recognize a significant similarity between the new sequence and the sequence family, we can infer that the new sequence belongs to the family and transfer information about the family structure and/or function to the new sequence.

In this analysis, we want to compare the sequence to the family using statistical features of the whole set of sequences in the family (and not features of each sequence in the family individually). Similarly, when the family membership is established, a more accurate alignment between the sequence and the family can be obtained by concentrating on features that are conserved on the whole family [3]. Therefore a representation that captures the features of the set of sequences in a family is necessary.

A simple representation of a family of related sequences is given by their multiple alignment and the corresponding profile. Hidden Markov models (HMMs) extend the profile representation of a multiple alignment of a family of sequences and provide a statistical representation of a family of sequences [19]. A profile HMM to model a given sequence family can be built from the multiple alignment of the sequences in the family.

In a typical application, we have a sequence family represented by a profile HMM and we want to identify new family members from a sequence database. Then we perform a database search, performing a sequence-profile comparison between each sequence from the database and the profile HMM.

The sequence-profile comparison produces as output a score and an alignment. The score is a measure of the similarity between the sequence and the family, while the alignment emphasizes the parts of the sequence that are similar/distinct to the family represented by the profile HMM. Most of the solutions to the sequence-profile comparison are based on an important algorithm called Viterbi algorithm [20] and its variations.

The Pfam database [21] is a large collection of protein families, represented by multiple sequence alignments and profile HMMs. It enables the comparison of protein sequences of interest to protein families, in search of additional homologous sequences.

HMMER [22] is a set of tools for the analysis of biological sequences using profile HMMs. One of its main tools is hmmsearch [23], which compares the sequences in a protein sequence database to a profile HMM representing a protein sequence family, in order to identify new homologous sequences. HMMER is widely used and many optimizations have been proposed to improve its performance.

The hmmsearch tool uses the following strategy: Each sequence passes through a chain of filters, where each filter computes a score for the sequence using a different algorithm for sequence-profile comparison. Depending on the score, the sequence is discarded (because we conclude the sequence is not homologous to the family) or forwarded to the next filter in the chain (because a more thorough analysis of the sequence is necessary). Therefore the initial filters are able to discard a certain amount of the sequences being compared, forwarding to the next filters only a fraction of them. The initial filters, which receive more sequences, use less precise and faster algorithms; while the final filters, which receive fewer sequences, use more precise and slower algorithms.

In HMMER2 (HMMER version 2), the hmmsearch tool uses the Viterbi algorithm as its main step. In HMMER3, the Multiple Segment Viterbi (MSV) algorithm is the first filter in the chain of filters of the hmmsearch tool. The Viterbi algorithm is a posterior filter in the chain. The MSV algorithm is a simplified version of the Viterbi algorithm, in which the profile HMM structure is reduced.

2.2.1 Hidden Markov models

HMMs are a probabilistic model based on Markov processes and have been applied to other research areas, such as speech recognition [24]. They have been widely used in many Bioinformatics problems.

The profile HMM representing a sequence family and used for sequence-profile comparison represents statistically the similarities and differences among the sequences in the family, extending the multiple alignment of these sequences into a position-specific scoring system. It consists of a set of states, the states’ transition probabilities, and the residue emission probabilities at each state. These states correspond to evolutionary possibilities, such as residue insertions and deletions, or to matches between the sequences in the multiple alignment of the sequences in the family.

Krogh et al. [25] proposed a profile HMM with a regular topology for modeling protein families, consisting of Match (M), Insert (I), and Delete (D) states. It has one state M for every conserved column of the multiple alignment of the family. The states I and D model gapped alignments, that is, alignments including residue insertions and deletions in the multiple alignment, respectively. Each state I allows one or more residues to be inserted between two consecutive states M. Each state D allows a residue deletion by skipping over a state M.

The transition probabilities associated with the state transitions also represent specific information about each column in the multiple alignment of the family. The profile HMM can model position-specific gap penalties, with different probabilities for entering different states I and D, representing that certain regions of the multiple alignment allow more insertions and deletions than others. Affine gap penalties can also be modeled with different transition probabilities for entering a state I for the first time and for staying in it.

Residue emission probabilities are associated with states M and I. Each state M emits a single residue, with an emission probability determined by the frequency that the residues are observed in the corresponding column of the multiple alignment. Each state I also emits a residue. This way each state M and I can have different emission probabilities for each different residue (the 4 nucleotide bases in DNA sequences or the 20 amino acids in protein sequences). D states do not emit any residues.

Eddy [26] introduced the Plan7 profile HMM architecture extending this basic profile HMM topology with the addition of special states B (Begin), E (End), N, C, and J (Joining) in order to allow different alignment algorithms to be applied, such as global alignment, local alignment with respect to the profile HMM, local alignment with respect to the sequence being compared, and multiple hit alignment.

The group of states M, I, and D corresponding to the same column in the multiple alignment is called a node of the profile HMM, and the length of the profile HMM is measured as the number of nodes in it. Fig. 3 illustrates a Plan7 profile HMM with four nodes. The labels in the state transitions represent some transition probabilities.

f06-03-9780128037386 — Fig. 3 Plan7 profile HMM with four nodes.

A local alignment with respect to the profile HMM is possible through transitions from state B to internal state M and from these states to state E. A local alignment with respect to the sequence is possible using the loop transitions on the special states N and C, which can emit residues. Multiple hits of the sequence to the profile HMM are allowed by transitions through the special state J [26].

2.2.2 The Viterbi algorithm

The Viterbi algorithm [20] is an optimal algorithm for finding the most likely sequence of states that result in a sequence of observed events, in the context of HMM. It has been applied in a variety of areas, such as digital communications and speech recognition. The Viterbi algorithm is used for the comparison between a sequence S and a profile HMM H, aligning the residues of S to the states of H, beginning at the state Start (S) and finishing at the state Termination (T). It applies the dynamic programming technique in order to find the best path of states in H that emits the sequence S, that is, the one with the maximum probability.

The logarithm of the probabilities is used in order to obtain an additive scoring system (instead of a multiplicative one), producing log-odds scores [3]. The algorithm computes score matrices and vectors corresponding to the states of the profile HMM, as shown here:

• Matrices M, I, and D, for Match, Insert, and Delete states, respectively, where the element [i, j] of each matrix contains the score of the best alignment that emits the first i residues of S and reaches the corresponding state in node j.

• Vectors N, B, E, C, and J for the special states, where the ith element of each vector contains the score of the best alignment that emits the first i residues of S and reaches the state associated with the vector.

Recurrence Eq. (2) describe the Viterbi algorithm for the comparison of a sequence S = s₁…s_L of length L to a Plan7 profile HMM of length Q, where tr(q₁, q₂) is the transition probability from state q₁ to q₂ and em(q, s) is the emission probability of residue s at state q. The score of the best alignment between S and H is given by C[L] + tr(C, T). The algorithm has time complexity O(L × Q).

$\begin{array}{l} \begin{matrix} M [i, j] & = e m (M_{j}, s_{i}) + max \{\begin{array}{l} M [i - 1, j - 1] + t r (M_{j - 1}, M_{j}) \\ I [i - 1, j - 1] + t r (I_{j - 1}, M_{j}) \\ D [i - 1, j - 1] + t r (D_{j - 1}, M_{j}) \\ B [i - 1] + t r (B, M_{j}) \end{array} \\ I [i, j] & = e m (I_{j}, s_{i}) + max \{\begin{array}{l} M [i - 1, j] + t r (M_{j}, I_{j}) \\ I [i - 1, j] + t r (I_{j}, I_{j}) \end{array} \\ D [i, j] & = max \{\begin{array}{l} M [i, j - 1] + t r (M_{j - 1}, D_{j}) \\ D [i, j - 1] + t r (D_{j - 1}, D_{j}) \end{array} \\ N [i] & = N [i - 1] + t r (N, N) \\ E [i] & = max_{1 \leq j \leq Q} (M [i, j] + t r (M_{j}, E)) \\ J [i] & = max \{\begin{array}{l} J [i - 1] + t r (J, J) \\ E [i] + t r (E, J) \end{array} \\ B [i] & = max \{\begin{array}{l} N [i] + t r (N, B) \\ J [i] + t r (J, B) \end{array} \\ C [i] & = max \{\begin{array}{l} C [i - 1] + t r (C, C) \\ E [i] + t r (E, C) \end{array} \end{matrix} \end{array}$ $\begin{array}{l} \begin{matrix} M [i, j] & = e m (M_{j}, s_{i}) + max \{\begin{array}{l} M [i - 1, j - 1] + t r (M_{j - 1}, M_{j}) \\ I [i - 1, j - 1] + t r (I_{j - 1}, M_{j}) \\ D [i - 1, j - 1] + t r (D_{j - 1}, M_{j}) \\ B [i - 1] + t r (B, M_{j}) \end{array} \\ I [i, j] & = e m (I_{j}, s_{i}) + max \{\begin{array}{l} M [i - 1, j] + t r (M_{j}, I_{j}) \\ I [i - 1, j] + t r (I_{j}, I_{j}) \end{array} \\ D [i, j] & = max \{\begin{array}{l} M [i, j - 1] + t r (M_{j - 1}, D_{j}) \\ D [i, j - 1] + t r (D_{j - 1}, D_{j}) \end{array} \\ N [i] & = N [i - 1] + t r (N, N) \\ E [i] & = max_{1 \leq j \leq Q} (M [i, j] + t r (M_{j}, E)) \\ J [i] & = max \{\begin{array}{l} J [i - 1] + t r (J, J) \\ E [i] + t r (E, J) \end{array} \\ B [i] & = max \{\begin{array}{l} N [i] + t r (N, B) \\ J [i] + t r (J, B) \end{array} \\ C [i] & = max \{\begin{array}{l} C [i - 1] + t r (C, C) \\ E [i] + t r (E, C) \end{array} \end{matrix} \end{array}$

si2_e (2)

After computing the similarity between the sequence and the profile HMM and concluding that the sequence belongs to the family modeled by the HMM, we can obtain the best alignment of the sequence to the profile HMM in order to add the sequence into the multiple alignment of the family (and into the profile HMM representation of the family). The alignment is obtained by tracing back on the Viterbi data structures.

Analyzing the Viterbi algorithm recurrence equations, we identify many data dependencies for computing the scores. For instance, the cells M[i − 1, j − 1], I[i − 1, j], and D[i, j − 1] are needed for computing the cells M[i, j], I[i, j], and D[i, j], respectively. These dependencies prevent the parallel computation of cells in the same row, column, or diagonal as the matrices.

The state J of the profile HMM links the end of the profile HMM core to its beginning, forming a feedback loop. This link creates the dependency chain $M [i - 1, 1 \dots Q] \to E [i - 1] \to J [i - 1] \to B [i - 1] \to M [i, j]$ $M [i - 1, 1 \dots Q] \to E [i - 1] \to J [i - 1] \to B [i - 1] \to M [i, j]$ , which prevents the parallel computation of cells in a same antidiagonal of the matrices.

2.2.3 The MSV algorithm

The MSV algorithm is a simplification of the Viterbi algorithm, obtained by removing from the Plan7 profile HMM the states I and D and considering transition probability of 1.0 for all transitions $M_{j - 1} \to M_{j}$ $M_{j - 1} \to M_{j}$ [23]. The resulting profile HMM is shown in Fig. 4.

f06-04-9780128037386 — Fig. 4 Profile HMM used by MSV algorithm.

Recurrence Eq. (3) describe the MSV algorithm for the comparison of a sequence S of length L to a profile HMM (with the structure shown in Fig. 4) of length Q. The score of the best alignment between S and H is given by C[L] + tr(C, T). The algorithm has time complexity O(L × Q).

$\begin{array}{l} \begin{matrix} M [i, j] & = e m (M_{j}, s_{i}) + max \{\begin{array}{l} M [i - 1, j - 1] \\ B [i - 1] + t r (B, M_{j}) \end{array} \\ N [i] & = N [i - 1] + t r (N, N) \\ E [i] & = max_{1 \leq j \leq Q} (M [i, j]) \\ J [i] & = max \{\begin{array}{l} J [i - 1] + t r (J, J) \\ E [i] + t r (E, J) \end{array} \\ B [i] & = max \{\begin{array}{l} N [i] + t r (N, B) \\ J [i] + t r (J, B) \end{array} \\ C [i] & = max \{\begin{array}{l} C [i - 1] + t r (C, C) \\ E [i] + t r (E, C) \end{array} \end{matrix} \end{array}$ $\begin{array}{l} \begin{matrix} M [i, j] & = e m (M_{j}, s_{i}) + max \{\begin{array}{l} M [i - 1, j - 1] \\ B [i - 1] + t r (B, M_{j}) \end{array} \\ N [i] & = N [i - 1] + t r (N, N) \\ E [i] & = max_{1 \leq j \leq Q} (M [i, j]) \\ J [i] & = max \{\begin{array}{l} J [i - 1] + t r (J, J) \\ E [i] + t r (E, J) \end{array} \\ B [i] & = max \{\begin{array}{l} N [i] + t r (N, B) \\ J [i] + t r (J, B) \end{array} \\ C [i] & = max \{\begin{array}{l} C [i - 1] + t r (C, C) \\ E [i] + t r (E, C) \end{array} \end{matrix} \end{array}$

si5_e (3)

Although it has the same time complexity as the Viterbi algorithm, the MSV algorithm performs fewer computations and has fewer data dependencies because of the removal of states I and D. As a consequence, all cells in the same row of matrix M can be computed in parallel, while successive rows must still be computed sequentially.

3 Design aspects of GPU solutions for biological sequence analysis

While each proposal brings different contributions for the solution of the pairwise sequence comparison or sequence-profile comparison using GPUs, some optimization techniques and approaches are used in several solutions. We enumerate these techniques here.

3.1 Task-Parallelism vs Data-Parallelism

The solutions for pairwise and sequence-profile comparisons adopt one or a combination of the two approaches to exploit parallelism: task-parallelism or data-parallelism. In general, if task-parallelism is used, a thread is associated with each sequence from the sequence database and is responsible for performing the comparison between that database sequence and the query sequence (pairwise sequence comparison) or query profile HMM (sequence-profile comparison). Since the comparison of distinct database sequences to the same query (i.e., the execution of the appropriate SW, BLAST, Viterbi, or MSV algorithm) involves independent tasks, the corresponding threads can execute in parallel (coarse-grained parallelism). In this case, we say that the solution exploits sequence parallelism. This strategy has the advantage of removing the need of interthread communication and is adopted by the solutions in Refs. [27–31] for pairwise sequence comparison and in Refs. [32–37] for sequence-profile comparison.

When data-parallelism is adopted, the comparison of a sequence from the sequence database with the query sequence or with the query profile HMM is assigned to multiple threads, each thread being responsible for computing one or more cells of the dynamic programming matrices (fine-grained parallelism). Since the cells computed by one thread are needed for computing cells assigned to different threads, there is intense need of interthread communication in this strategy. For pairwise comparison, the solutions in Refs. [38–43] exclusively exploit data-parallelism. Among the sequence-profile solutions, only Ref. [44] exclusively exploit data-parallelism.

Finally, both forms of parallelism can be combined. In general, we have a set of threads associated with each sequence from the sequence database, each thread being responsible for computing one or more cells of the dynamic programming matrices. And different sequences of the database are compared in parallel, by different sets of threads, to the query sequence or query profile HMM. The solutions in Refs. [45–48] adopt this strategy to perform pairwise comparisons, while the solutions in Refs. [49, 50] also use both forms of parallelism to compute sequence-profile comparisons.

3.2 Sorting Sequence Database to Achieve Load Balance

The Viterbi, MSV, SW, and BLAST algorithms are sensitive to the length of the target sequence, which in conjunction with the profile HMM or the subject sequence length, determines the execution time of the algorithms. A typical sequence database usually contains sequences with very different lengths and does not have the sequences ordered by their length, that is, short sequences may be close to long sequences.

Therefore solutions that exploit task-parallelism (combined or not with data-parallelism) can have unbalanced workloads among threads, since threads processing shorter sequences finish first and stay idle, waiting while the longest sequence is processed.

In order to minimize this problem, the works in Refs. [29, 31–33, 35–37, 45–48, 50] sort the sequences from the sequence database by sequence length, usually in decreasing order. This is accomplished in a preprocessing step executed on the CPU, previous to the sequence-profile or pairwise comparison. As a result, sequences with similar lengths are assigned to threads in a same warp or in a same block, thereby achieving a better workload balance among them.

3.3 Use of GPU Memory Hierarchy

GPU memory hierarchy includes several memories with very different features, such as latency, bandwidth, read-only or read-write access, and so on. For instance, in the CUDA architecture, there are registers, shared, global, constant, and texture memories [51].

Memory accesses usually have a great impact on GPU programs performance; therefore most proposals try to optimize the memory layout and usage patterns of the data structures used by the implemented algorithm. SW, BLAST, Viterbi, and MSV algorithms include several large data structures, such as the dynamic programming score matrices and the sequences. The last two algorithms also include the transition and emission probabilities of the profile HMM.

The memory where each data structure is allocated is chosen based on the size of that structure and the types of access performed on it (few or many accesses, only reads or reads and writes, several threads reading from the same position, etc.). A particular data structure may be reorganized in order to provide a better usage pattern regarding the memory it is allocated.

Several solutions for pairwise or sequence-profile comparison [33–37, 43–45, 49, 50] allocate their data structures in GPU global memory and some of them rearrange the memory accesses of threads into favorable patterns in order to achieve memory coalescing. The idea is that when the threads in a warp access consecutive global memory locations, the hardware is able to combine all accesses into a single memory request. Such coalesced accesses allow GPU global memory to deliver data at a rate close to its peak bandwidth [51].

Registers, shared, constant, and texture memories can be highly effective in reducing the number of accesses to global memory. However, these memories have limited capacity, which may also restrict the number of threads executing simultaneously on a GPU streaming multiprocessor.

Accessing GPU shared memory is very fast and highly parallel; therefore several proposals [29, 31, 33, 34, 36, 38, 39, 43, 45, 48, 50, 52] use this memory to hold the portion of GPU global memory data that is heavily used in an execution phase of the algorithm. In a further step, the algorithm used may be reorganized in order to create execution phases that focus heavily on small portions of global memory data.

With appropriate access patterns, accessing constant memory is very fast and parallel. This memory provides short-latency, high-bandwidth, and read-only access by the device when all threads simultaneously access the same location, and broadcasts the data accessed to all threads in a warp [51]. Constant memory is used by the works in Refs. [28, 29, 31, 33, 36, 45].

Texture memory can also be used to avoid global memory bandwidth limitations and handle memory accesses with certain access patterns. Although texture memory was originally designed for traditional graphics applications, it can also be used quite effectively in some general-purpose GPU computing applications. The solutions in Refs. [27, 29, 32–35, 37, 39, 50] allocate data structures in this memory.

3.4 GPU Solution Used as a Filter

The recurrence equations executed in the SW, BLAST, Viterbi, and MSV algorithms present a dependency pattern in such a way that, in order to compute only the best alignment score, it is not necessary to store the whole dynamic programming matrices and vectors. Actually, only two rows of these data structures are needed, the current row (which is being computed) and the previous one. However, in order to execute the traceback operation and produce the best alignment, the whole structures must be stored, requiring a space of O(m × n), where m and n are the lengths of the sequences in a pairwise comparison, or O(L × Q) for a profile HMM with Q nodes and a sequence of length L, in a sequence-profile comparison, for each sequence from the database. Given the long length of the biological sequences and the large number of sequences in sequence databases, the amount of memory space required is usually substantial and can exhaust the memory of the GPU.

The traceback is needed only when the comparison results in a hit, that is, the best alignment score is superior to a significance threshold and a homology is detected. Experiments reported indicate that less than 2% of the database sequences result in hits [53] for sequence-profile comparison, and similar results are obtained for pairwise comparisons with genomic databases.

Considering the high demand for memory space and low homology hit ratio, the solutions in Refs. [27, 28, 30, 31, 38, 39, 42, 43, 45–47] for pairwise comparison and in Refs. [32–37, 44, 49, 50] for sequence-profile comparison work as a first phase filter that computes only the best alignment score (through the SW, BLAST, Viterbi, or MSV algorithm). This filter, implemented in the GPU and taking advantage of its parallel computing capabilities, processes all the database sequences in order to discard the low-scoring sequences. For the small fraction of sequences that produce significant best-alignment scores, the entire comparison (pairwise or sequence-profile) is reprocessed in the CPU in order to generate the corresponding alignment. This time the SW, BLAST, or Viterbi algorithm is executed keeping the entire necessary data structures to produce the best alignment.

4 GPU Solutions for Pairwise Sequence Comparison

Recently, the use of GPUs to implement pairwise biological sequence comparison algorithms has become very popular. This happens mainly because GPUs are able to attain very good performance at low cost. In this section, we present several GPU solutions for the pairwise sequence comparison problem.

4.1 GPU Solutions Using Exact Algorithms

The SW algorithm and its variations have quadratic time complexity, and for this reason, they are excellent candidates for parallelization. In this section, we will present first the approaches that compare one protein sequence to all sequences in a sequence database (Sections 4.1.1–4.1.6). In this case, a protein query sequence q will be compared to a set of y protein sequences (d₁, d₂, …, d_y) that belongs to a genomic database. This comparison is known as coarse-grained since each thread calculates a different dynamic programming matrix, thus there is usually no communication among the threads. In Section 4.1.7, we present a solution that compares two long DNA sequences, computing huge dynamic programming matrices, to multiple threads.

Most of the GPU approaches use two techniques that were initially proposed to run in CPUs, taking advantage of single instruction multiple data (SIMD) instruction sets such as AltiVec (PowerPC) and SSE (Intel). In this case, vector instructions are executed that calculate a set of elements in parallel. The first technique is called query profile [54], and it targets the case where a single protein sequence is compared to a genomic database. Since the same protein sequence (query) will be compared to different sequences that belong to a genomic database, the algorithm computes a small specific vector Q based on the query sequence and a given substitution matrix. Therefore the access of punctuations for matches/mismatches is sequential, instead of random. The second technique is called striped processing [55], and instead of calculating consecutive cells of the matrix with the same vector instruction, it calculates at the same time cells that are separated by a stripe factor. With this technique, the dependencies of the innermost loop are broken, accelerating the computation.

The comparison of exact pairwise sequence comparison approaches is usually made by using the metrics GCUPS. This metric is calculated by dividing the size of the dynamic programming matrices (m × n) by the execution time in seconds multiplied by 10⁹. In this section, we use GCUPS to express the performance of the solutions.

4.1.1 Manavski and Valle [27]

A multi-GPU-accelerated version of SW in CUDA was proposed in Ref. [27]. In order to optimize the access to the substitution matrix, the authors used query profile. Scores were restricted to 16-bit values. The database was sorted by size and blocks of 256 sequences were created.

The memory hierarchy of the NVIDIA GPU was carefully studied in order to decide where to place the most critical data. Since the query profile is highly accessed, the authors decided to place it in the GPU texture memory. Because of its size, the genomic database was placed into the global memory, and the dynamic programming cell values (matrices H and E) were also stored in global memory.

Each GPU thread computes the whole alignment between the query sequence and one target sequence (task-parallelism). Query sequences of up to 2050 amino acids were compared to the SwissProt genomic database, containing 250,296 protein sequences. Results of 1.889 GCUPS and 3.612 GCUPS were obtained for one and two GPUs NVIDIA GeForce 8800 GTX, respectively. When compared to the serial SW implementation called SSEARCH, this approach achieved a speedup of 15.81.

4.1.2 Ligowski and Rudnicki [38]

The approach proposed by Ref. [38] compares a query sequence against a database with SW, returning the best score obtained by each alignment. For a single comparison, the dynamic programming matrix is divided in groups of 12 rows, which are computed by a set of threads (data-parallelism). Each step processes 12 cells that are placed in shared memory and registers with only two global memory transactions, giving additional speedup. This is an improvement over Manavski and Valle (see Section 4.1.1), which placed all cells in the slow global memory.

Results of 7.5 GCUPS and 14.5 GCUPS were obtained for one and two GPUs (NVIDIA 9800 GX2, dual-GPU), respectively, when comparing sequences of up to 1000 amino acids to the SwissProt genomic database containing 388,517 proteins.

4.1.3 Blazwicz et al. [29]

Blazwicz et al. [29] proposed a strategy to run SW (first and second phases) in multiple GPUs. Since the SW algorithm executes in quadratic space, only short sequences (≤2000 amino acids) were compared. In this proposal, coarse-grained parallelization is used in the top level. The algorithm executes as follows. Each idle GPU retrieves one window (set of tasks that can be executed in one kernel invocation) from the top of the queue until the queue is empty, on a self-scheduling (SS) basis. A task is defined to be a single comparison between q and d_i.

The data structures are placed in the NVIDIA GPU memories as follows. Four additional boolean matrices, which are used in the second phase of the SW algorithm, are packed into 32-bit words and placed in global memory. The cells of matrices H and E are also placed in global memory. The cells of matrix F that are being calculated at a given moment are placed in shared memory. The substitution matrix is placed in constant memory, whereas the sequences are placed into texture memory.

In the tests, sets of randomly selected sequences from the Ensembl database were used. The maximum number of database sequences was 1600 and the maximum size of the query sequence was 1103 amino acids. With four homogeneous GPUs (NVIDIA Tesla S1070), a maximum of 11.5 GCUPS was attained. Note that this metric does not include the execution of the second phase of the SW algorithm. When compared to the Farrar SIMD CPU solution [55], this approach achieved speedups of 3.06 (one GPU) and 12.20 (four GPUs).

4.1.4 Li et al. [52]

Li et al. [52] proposed optimized strategies to compute SW scores or/and alignments of long sequences with the affine gap function. The first two strategies, called StripedScore and BlockedAntidiagonal, divide the DP matrix in strips of equal size and process the matrix by antidiagonals. The differences between these two strategies are mainly related to synchronization and block assignment.

The authors also proposed three strategies to retrieve the alignment. StripedAlignment retrieves the alignment in three phases. In the first phase, all DP matrices are calculated by antidiagonals, information is stored about the boundary columns of each strip in global memory, and the optimal score and its position are found. The values of the DP matrices that are being computed are stored in shared memory. The second phase executes serially, finding the cells that are the start and the end of the optimal alignment in each strip. Having these cells, each strip can recalculate its part of the DP matrices, producing the optimal alignment.

In the strategy ChunkedAlignment1, the DP matrices are divided into columns and rows, reducing the area reprocessed, as compared to StripedAlignment. This reduction in the reprocessed area is due to an additional data structure (horizontal buffer), which is stored in global memory. ChunkedAlignment2 stores more information in Phase 1, allowing Phase 3 to be executed in parallel.

The authors did an extensive evaluation of their strategies using the NVIDIA Tesla C2050, with query sequences of up to 51,030 amino acids. The proposed strategies were compared with three state-of-the-art tools, including CUDASW++ 2.0 (Section 4.1.6). The authors showed that their approach outperformed the state-of-the-art tools for long sequences. Also, the authors concluded that ChunkedAlignment1 provides the best performance for most cases. The results showed a maximum performance of 7.1 GCUPS.

4.1.5 Ino et al. [28, 30]

Ino et al. [28] proposed the use of an undedicated grid composed of multiple heterogeneous GPUs to execute SW. When the screen saver was activated, coarse-grained SW tasks were executed. The authors proposed a master/slave architecture with three components: grid resources, resource manager server, and clients, where the resource manager server is the master and the grid resources are the slaves.

Given x query sequences to be compared to a genomic database composed of y sequences, each task was responsible for the comparison of one query sequence to the whole database. Tasks were distributed using an SS policy. In the slaves, the GPU SW comparison was done by an OpenGL program. In the tests, 64 query sequences of size 367 were compared to the Swiss-Prot database release 51.0 (241,242 sequences). The platform was composed of eight heterogeneous GPUs (NVIDIA 8800 and 7900 series). In this platform, a maximum of 3.09 GCUPS was obtained.

The work of Ref. [28] was extended by Ref. [30] in order to take advantage of shorter idle periods, so in this case, the screensaver was not used. As in Ref. [28], a master/slave architecture was used, and idle resources executed coarse-grained SW tasks. Tasks were then divided into small subtasks of size 32K, and these subtasks were executed by the GPU. SW subtasks were distributed to the workers in a modified SS fashion, where resources with longer idle periods received tasks first. SW subtasks can be canceled if the resource changes its state to busy. In this case, the subtask was reassigned to another idle resource. Because the subtasks were very small, the sequences being compared were placed in constant memory.

The tests compared iteratively a set of query sequences ranging from 63 to 511 amino acids to a genomic database. The results obtained in a platform composed of eight heterogeneous NVIDIA GPUs (five GTX 285, one GTX 295, one FX 5800, and one 8800 GTX), each one connected to a host machine, show that 64 GCUPS can be achieved.

4.1.6 Liu et al. [45–47]

CUDASW++ 1.0 was proposed in Ref. [45], and it executed SW with affine gap in single and multiple GPUs, obtaining the optimal score. Two levels of parallelism (intertask and intratask) were proposed. Intertask parallelization is coarse-grained and assigns each query × database sequence comparison to a different thread. On the other hand, intratask parallelization uses multiple threads in a single query × database comparison (data-parallelism). Small query sequences were compared with intertask parallelization, whereas long query sequences used the intratask mode. In order to achieve load balancing, query sequences were ordered by their lengths. Also, great care was taken with the manipulation of the GPU memory hierarchy. Database sequences were placed in global memory and then accessed through a coalesced access pattern. Global memory was also used to store the DP matrices, which were organized in a way that allowed coalesced accesses. The DP cells that were being calculated in a given moment were stored in registers. Shared memory was used to store the substitution matrix, and the query sequence was placed in constant memory. Results of 16.08 GCUPS were obtained for the NVIDIA GeForce GTX 295 (dual-GPU), for a maximum query size of 5478 amino acids. Twenty-five query sequences were compared to the genomic database SwissProt, release 56.6, with 405,506 protein sequences. A speedup of 1.46 was obtained when comparing CUDASW++ 1.0 with the vectorized tool SWPS3 running on two Intel Xeon dual-core (four cores).

CUDASW++ 2.0 [46] integrated SIMD optimizations such as query profile and striped pattern into its design. Query profile was used in the intertask parallelization mode, and the striped pattern was used in both modes. Two implementations of SW were proposed. The first implementation was an optimized Single Instruction Multiple Thread (SIMT) version for the intertask parallelization, which used a sequential query profile and a packed data format. The second implementation was a variant of the striped SW described in Ref. [55] that divided the query sequences into small partitions that are aligned, one by one, with the subject sequence. Twenty query sequences were compared to SwissProt, release 56.6. Results with query sequences of up to 5478 amino acids were shown, achieving 16.9 GCUPS on NVIDIA GTX 280 and 28.8 GCUPS on GTX 295 (dual-GPU). With high gap penalties, CUDASW++ 2.0 achieved 17.8 GCUPS on GTX 280 and 29.7 GCUPS on GTX 295.

CUDASW++ 3.0 [47] proposed a hybrid approach that used both CPUs and GPUs to compare sequences. Load was distributed statically, considering the clock frequencies and number of cores (CPUs) and the number of symmetric multiprocessors and the number of cores (GPUs). The CPU code used query profile and the striped pattern and the GPU code used CUDA PTX SIMD instructions (low-level programming) combined with the idea of a striped pattern to accelerate the computation. Results were collected on an Intel i7 (quad-core) combined with three GPUs (one NVIDIA GTX 680 and a NVIDIA GTX 690, dual-GPU). Query sequences of up to 5478 amino acids were compared to the SwissProt genomic database (538,585 protein sequences), achieving a maximum GCUPS of 185.6 (4 cores + GTX 690).

4.1.7 Sandes and de Melo [39–41] and Sandes et al. [42]

CUDAlign 1.0 [39] compares two huge DNA sequences (fine-grained) to SW, assigning blocks to threads in a parallelogram shape, and provides the optimal score as the output. In order to increase performance, it divides the code that runs on GPU into two kernels, optimized for specific situations. The input sequences are placed in texture memory, and global memory is used to store border elements of the DP matrices. The DP cells that are being calculated are stored in shared memory and registers. In the study by Sandes et al., results were collected in the NVIDIA GTX 280 GPU, and sequences with lengths up to 32 millions of base pairs (MBP), achieving a maximum GCUPS of 20.37.

CUDAlign 2.0 [40] retrieves the alignment between two huge DNA sequences with an adapted version of the MM algorithm, executing in linear space. It executes in six stages, where stage 1 obtains the optimal score with CUDAlign 1.0, saving some rows to disk. Stages 2–5 execute a modified version of MM (see Section 2.1.1), retrieving the coordinates of the points that belong to the optimal alignment in a divide-and-conquer way. Stage 6 is used optionally to visualize the alignment. In the Sandes et al. studies, the results collected in the NVIDIA GTX 285 comparing sequences with sizes up to 33 MBP showed a maximum GCUPS of 23.63. When compared to the z-aligned nonvectorized cluster tool running on a cluster of 64 cores, CUDAlign achieved a speedup of 19.52.

CUDAlign 2.1 [41] calculates the alignment as in CUDAlign 2.0, with the block pruning optimization, which is able to eliminate the computation of matrix cells that surely will not contribute to the optimal alignment. The results collected in the NVIDIA GTX 560 Ti comparing sequences with sizes up to 33 MBP showed a maximum GCUPS of 58.21.

CUDAlign 3.0 [42] executes SW with affine gap in multiple GPUs with fine-grained parallelism and outputs the optimal score. It uses circular buffers to overlap computation and communication, connecting the GPU nodes with TCP sockets. In the Sandes et al. studies, the results collected with up to 64 GPUs NVIDIA Tesla M2090 presented a maximum GCUPS of 1726 when comparing DNA sequences of 228 MBP × 249 MBP.

4.1.8 Comparative overview

Table 1 presents a comparative view of the solutions discussed in this section. Five solutions [27, 29, 30, 38, 42] use more than one GPU (2–64), four use one GPU [39–41, 52], three use one dual-GPU [38, 45, 46], and one solution [47] combines one GPU with one CPU to compare biological sequences.

Table 1

Summary of GPU Solutions for SW Execution

				Parallelism		Maximum
Solution	GPU	Variation	Filter	Task	Data	GCUPS
Manavski et al. [27]	2 × 8800 GTX	Affine-gap	Yes	•		3.61
Ino et al. [28]	8 × (8800, 9800)	Affine-gap	Yes	•		3.09
Ligowski et al. [38]	9800 GX2 (dual)	Affine-gap	Yes		•	14.50
Liu et al. [45]	GTX 295 (dual)	Affine-gap	Yes	•	•	16.08
Sandes and de Melo [39]	GTX 280	Affine-gap	Yes		•	20.37
Liu et al. [46]	GTX 295 (dual)	Affine-gap	Yes	•	•	29.70
Blazwicz et al. [29]	4 × Tesla S1070	Affine-gap	No	•		11.50
Sandes and de	GTX 285	Affine-gap	No		•	23.63
Melo [40]		MM
Li et al. [52]	Tesla C2050	Affine-gap	No		•	7.10
Ino et al. [30]	10 × GTX 285–295,	Affine-gap	Yes	•		64.00
	FX 5800, 8800 GTX
Sandes and de	GT 560Ti	Affine-gap	No		•	58.21
Melo [41]		MM
Liu et al. [47]	GTX 295 (dual)	Affine-gap	Yes	•	•	185.60
	CPU (4 cores)
Sandes et al. [42]	64 × Tesla M2090	Affine-gap	Yes		•	1726.00

t0010

All solutions implement the affine-gap model, and most of them act as a filter, retrieving only the best score. Two solutions [40, 41] retrieve the best alignment in linear space in GPU with the MM algorithm (see Section 2.1.1), and one solution [29] retrieves the optimal alignment of two small sequences in GPU in quadratic space.

Most solutions either use data-parallelism, computing one comparison in a multithreaded way, or task-parallelism, assigning a different comparison to each thread. CUDASW++ 1.0 [45], 2.0 [46], and 3.0 [47] use both data- and task-parallelism to accelerate the computation.

Performance comparison of SW solutions is a critical issue because the results of each solution are collected in different hardware/software platforms. In order to address this problem, the metrics of GCUPS were proposed, and this is the metric currently used to compare GPU solutions for SW. In the last column of Table 1, we included the GCUPS achieved by each solution. We can observe an astonishing improvement on the performance of GPU solutions for SW, ranging from 3.09 (2009) to 1726.00 GCUPS (2014). In our opinion, this was possible due to a combination of two factors: (a) the great technological advancement in GPU technology and (b) the advancements in GPU-based parallelization techniques for the SW algorithm, which were able to generate highly optimized parallel versions.

4.2 GPU Solutions Using BLAST

As discussed in Section 2.1.2, BLAST is a heuristic algorithm that retrieves local alignments much faster than SW. Therefore the execution times of BLAST are usually low. However, in the last decade the genomic databases have experienced exponential growth, and many of these databases now have more than 20,000,000 sequences. In such scenarios, GPUs are a good alternative because they can be used to produce BLAST results faster. In the literature, we can find some solutions that use GPUs to compute BLAST alignments. These solutions are discussed in the next sections.

4.2.1 Vouzis and Sahinidis [31]

GPU-BLAST [31] targets BLASTP, and it is one of the most popular BLAST tools based on GPU. The authors determined that the first and second steps, seeding and extension (see Section 2.1.2), are more computationally intensive and thus decided to parallelize these steps.

Both the GPU and CPU compared a protein query sequence to protein sequences that belonged to a database (subject). Each thread received a different query × subject pair (task-parallelism), and sequences were sorted according to their lengths. The placement of data structures in the GPU memory were carefully designed so that small bit vectors were stored in shared memory to accelerate the computation. Large data tables and the database sequences were placed in global memory, whereas constant memory was used to store the query sequence.

The experimental results were collected in a machine with a NVIDIA Tesla C2050 GPU and a 6-core Intel Xeon. The sequences were distributed between the CPU and GPU proportional to their computing power. The genomic database env_nr (more than 6 million sequences) was compared to 51 query sequences whose lengths ranged from 2 to 4998 amino acids. GPU-BLAST (6-core and GPU) achieved a speedup of 1.6 when compared to the 6-core BLAST version and 4.0 when compared to the serial BLAST implementation.

4.2.2 Liu et al. [48]

In Ref. [48], the authors proposed mpiCUDA-BLASTP, which uses a distributed master-slave architecture that explores data-parallelism in a GPU cluster. Sequences are sorted by their lengths, and the genomic database is split into batches of approximately the same size. Stages 1 and 2 (see Section 2.1.2) are processed independently by each thread (task-parallelism), whereas stage 3 is processed using data-parallelism, where each GPU thread computes a set of diagonals for each HSP. In order to improve performance, this set of diagonals was stored in shared memory.

In the tests, 14 sequences with sizes ranging from 505 to 2512 amino acids were compared to the nr database (more than 21 million sequences) in a GPU cluster composed of six nodes, each containing an AMD Opteron quad-core and one NVIDIA Tesla S1060. The tool mpiCUDA-BLASTP (six GPUs) was compared to GPU-BLAST (see Section 4.2.1) with one GPU, achieving a speedup of 6.6. When compared to the serial BLAST implementation, mpiCUDA-BLASTP achieved approximately a speedup of 26.

4.2.3 Zhang et al. [43]

A data-parallel approach called cuBlastP was proposed in Ref. [43] to execute BLAST on GPU. In stage 1, each thread detected hit positions for a different word of size w in the current query × subject comparison, respecting the coalesced memory access pattern and using a data structure organized as bins over the columns. Since the bins index columns, there is an additional phase called hit reorganization that maps bins on diagonals. The second phase is also executed in a fine-grained way, and the authors use a hierarchical buffer to place data structures in shared and global memory. Accesses to the global memory are coalesced.

Experimental results were collected separately in two machines. The first machine had an Intel i5 connected to the NVIDIA K20c GPU, and the second machine had the same CPU processor connected to a NVIDIA Fermi board. Two genomic databases were used in the tests: nr, with 6 million sequences, and SwissProt, with 300,000 sequences. Three query sequences with 127, 517, and 1054 amino acids were compared to these databases. When compared to GPU-BLAST, cuBlastP achieved a maximum speedup of 2.8. A speedup of 7.8 was obtained when compared to the serial FSA-BLAST.

4.2.4 Comparative overview

Table 2 presents a comparative overview of BLAST solutions for GPUs. As can be seen, all solutions implement BLASTP, comparing protein query sequences to genomic databases.

Table 2

Summary of GPU Solutions for BLAST Execution

				Parallelism		Maximum
Solution	GPU	Algorithm	Filter	Task	Data	Speedup
Vouzis et al. [31]	Tesla C2050	BLASTP	Yes	•		4.0
	6-core CPU
Liu et al. [48]	6 × Tesla S1060	BLASTP	No	•	•	26.0
Zhang et al. [43]	K20c	BLASTP	Yes		•	7.8

t0015

Two solutions [31, 43] act as filters, retrieving only the BLAST score. The BLAST algorithm is executed entirely in GPU by Ref. [48], giving as output the alignments that have score greater than a given threshold. One solution [31] takes advantage of data-parallelism, one solution [43] targets task-parallelism, and one solution [48] exploits both forms of parallelism. When compared to the serial BLAST algorithm, the best speedup achieved was 26.0 [31], using six GPUs.

5 GPU Solutions for Sequence-Profile Comparison

Given the high acceptance of pairwise sequence comparison solutions on GPUs, several implementations of sequence-profile comparison algorithms, targeting GPUs as a parallel computing platform, have also been proposed. In this section, we describe several GPU solutions for the sequence-profile comparison problem.

5.1 GPU Solutions Using the Viterbi Algorithm

We present first the GPU solutions proposed for sequence-profile comparison targeting the Viterbi algorithm. These solutions are based on HMMER2, where the hmmsearch tool uses the Viterbi algorithm as its main step. Because the Viterbi algorithm has quadratic time complexity, it is an excellent candidate to parallelization.

5.1.1 Horn et al. [32]

ClawHMMer [32] is an implementation of HMMER2 hmmsearch tool on GPU, exploiting sequence parallelism. It is based on the Brook stream programming language for GPUs, not on the CUDA programming model, which was not yet available at the time of the implementation. It implements the Viterbi algorithm based on profile HMMs, which do not have the complete Plan7 structure. There are no states N and C; therefore local alignments (with respect to the sequence) between the sequence and the profile HMM are not allowed.

In a preprocessing step performed on the CPU, the sequence database was sorted by sequence length in order to provide load balance. The sequence database was also divided into smaller parts (usually denominated batches) that fitted in the GPU memory. This way, sequences in the same batch would have similar lengths, and different batches were executed sequentially on the GPU. The GPU solution worked as a filter, storing only two rows of the score matrices. Therefore if an homology hit was detected, the Viterbi algorithm would be reexecuted on the CPU, followed by the traceback operation. The transition probabilities were allocated into registers, while the emission probabilities were stored in the GPU texture memory.

The performance evaluation executed the GPU solution on different GPUs (ATI 9800XT, ATI X800XT PE, NVIDIA 6800 Ultra, NVIDIA 7800 GTX, ATI R520) and compared it to the HMMER 2.3.2 hmmsearch tool running on an Intel Pentium 4 Xeon and on a PowerPC G5 (with AltiVec vector instructions enabled or not). The experiments compared 2.4 million proteins from the NCBI protein database [56] to profile HMMs (with 139–3271 states—not nodes) from the Pfam database [21]. The best results for ClawHMMer were obtained on the GPU ATI R520, which outperformed HMMER2 on the three platforms. It reached average speedups of 26.5 and 2.7, compared to HMMER2 executing on Intel Pentium 4 and on PowerPC G5 (with AltiVec vector instructions enabled), respectively.

The authors also performed experiments on a 16-node cluster with a GPU Radeon 9800 Pro in each node. A scheme to distribute the sequence database among the nodes was applied in order to provide load balance and achieve scalability of the solution. According to the authors, for 16 nodes, the achieved speedup was 95% of the ideal speedup.

5.1.2 Du et al. [44]

Du et al. [44] implemented the Viterbi algorithm on GPU, based on a profile HMM structure with only the states M, I, and D. They used the CUDA programming model and exploited a data-parallelism approach. Since the profile HMM structure does not have the state J, the data dependency chain $M [i - 1, 1 \dots Q] \to \dots \to J [i - 1] \to \dots \to M [i, j]$ $M [i - 1, 1 \dots Q] \to \dots \to J [i - 1] \to \dots \to M [i, j]$ , described in Section 2.2.2, is broken. Therefore the cells in the same antidiagonal of the dynamic programming score matrices M, I, and D become independent of each other. As a result, all cells in the same antidiagonal are computed in parallel, while successive antidiagonals are computed sequentially, respecting the dependencies. In this case, we say that the solution exploits antidiagonal parallelism. This approach enables exploiting fine-grained parallelism at the expense of accuracy loss, because multihit alignments between the sequence and the profile HMM are not possible without the state J.

The authors implement three different approaches, wavefront, streaming, and tile-based. In the wavefront approach, the score matrices are stored completely in the GPU memory and computed exploiting antidiagonal parallelism. These matrices are organized in a skewed form, so that an antidiagonal can be treated as a row, and the cells in the same antidiagonal occupy contiguous positions in the memory. As a consequence, when the parallel threads access the cells in a same antidiagonal, contiguous positions in memory are accessed, producing a more efficient memory transfer. Global memory is used, respecting the coalesced memory access pattern. The traceback operation is performed in the GPU. Because the entire score matrices are allocated in the GPU memory, this first solution is not able to handle sequences with length longer than 1000.

In the streaming solution, only three antidiagonals of the score matrices are allocated in the GPU memory, the current one being calculated and the two previous ones. The complete matrices are stored in the CPU memory. When the GPU finishes computing an antidiagonal, it is transferred to the CPU, while the next antidiagonal is computed, overlapping GPU computation and GPU-CPU transfers. Because the GPU memory does not store the whole score matrices, the traceback operation is executed on the CPU. The streaming approach provides a solution that is able to handle longer sequences; however, the GPU-CPU transfers can have a negative impact on performance.

The tile-based solution applies a preprocessing step on the CPU to find homologous segments between the sequences. The homologous segments are locally aligned and used to divide the score matrices into smaller and independent tiles that correspond to submatrices of the score matrices. As long as a tile fits entirely in the GPU memory, the tiles are compared on the GPU, sequentially or in parallel, depending on the tile sizes. The tile-based approach presents a solution that is able to handle longer sequences and that is faster than the streaming approach, because it needs to calculate only a portion of the score matrices. The authors describe this approach based on pairwise comparison, and the extension of this method for sequence-profile comparison is not provided.

The performance evaluation executed the GPU solutions on a GPU NVIDIA GeForce 9800 GTX and compared them to a serial implementation of the Viterbi algorithm running on an Intel dual-core processor. The experiments compared artificially generated sequences (lengths between 100 and 1000) and profile HMMs. The wavefront, streaming, and tile-based approaches achieved speedups of 22.3, 20.8, and 24.5, on average, respectively.

5.1.3 Walters et al. [33]

Walters et al. [33] proposed an implementation of HMMER2 hmmsearch tool on GPU, using the CUDA programming model. They implemented the Viterbi algorithm based on Plan7 profile HMMs and exploited sequence parallelism, with each thread operating on a different sequence from the sequence database. The number of parallel threads (and, as a consequence, the number of sequences that can be compared in parallel) was limited by the number of registers used by each thread. In a preprocessing step performed on the CPU, the sequence database was sorted by sequence length, in order to provide load balance.

Only two rows of the score matrices were stored in GPU global memory, and the GPU solution worked as a filter. Therefore if an homology hit was detected, the Viterbi algorithm was reexecuted on the CPU, followed by the traceback operation. Loop unrolling of the inner loop (which iterates over the profile HMM nodes) was applied; however, it had the impact of increasing the register pressure.

The allocation of the data structures in the GPU memory hierarchy was carefully designed. Score matrices were organized in a way to provide coalesced access to global memory. Transition and emission probabilities were kept in GPU constant and texture memories, while the sequences were allocated in texture memory. Shared memory was used to store temporarily the current symbol of the sequence of each thread, which was used to index the emission probabilities.

The performance evaluation executed the GPU solution on a GPU NVIDIA GeForce 8800 GTX Ultra and compared it to a serial implementation of the Viterbi algorithm running on an AMD Athlon 275. The experiments compared over 5.5 million from the NCBI nonredundant database [56] (with sequence lengths from 6 to 37,000) to five profile HMMs (with 779–1431 nodes) from the HMMER suite and the Pfam database [21]. The final optimized solution achieved a speedup of 27.7, on average.

5.1.4 Yao et al. [34]

CuHMMer [34] also presents an implementation of the HMMER2 hmmsearch tool on GPU, using the CUDA programming model. It implements the Viterbi algorithm based on Plan7 profile HMMs and exploits sequence parallelism. In a preprocessing step performed on the CPU, the sequences from the sequence database were organized, based on their length, in groups corresponding to length ranges. In the same way as in the sequence database sorting technique, the goal was to provide load balance.

The GPU solution worked as a filter, consequently, if a homology hit was detected, the Viterbi algorithm was reexecuted on the CPU, followed by the traceback operation. The sequences were allocated on GPU global memory, while the score matrices were stored in shared and texture memories. The authors developed a multithreaded solution for the CPU, which could also process sequences while the GPU was executing. This way, the workload was partitioned between the CPU and the GPU, and the CPU did not stay idle while the GPU was executing.

The performance evaluation executed CuHMMer on a GPU NVIDIA GeForce 8800 GTX attached to an Intel Core2 E7200, and compared it to the HMMER2 hmmsearch tool running on an AMD Athlon64 X2 Dual Core 3800+. The experiments compared four different sequence databases to four profile HMMs (with 37–461 nodes) from the Pfam database [21]. The best results were achieved with the UniProtKB/Swiss-Prot sequence database [57], producing a speedup of 35.5, on average.

5.1.5 Ganesan et al. [49]

Ganesan et al. [49] proposed an implementation of the HMMER2 hmmsearch tool on GPU, using the CUDA programming model. They implemented the Viterbi algorithm based on Plan7 profile HMMs and exploited task- and data-parallelism. The comparison of a sequence from the sequence database to the profile HMM was assigned to multiple threads, and different sequences from the database were compared in parallel by different sets of threads.

The authors implemented a method to parallelize the evaluations of the Viterbi algorithm recurrence equations by partitioning the chain of dependencies in a regular manner. Using this method, they calculated cells of the same row of score matrices in parallel, while successive rows were computed sequentially. The method started by dividing each row into partitions, where a partition is a set of contiguous cells, and by computing the anchors, where an anchor is the initial cell of each partition. The anchors were computed sequentially, with each anchor depending only on the previous one in the row. Then all cells in the row were computed in parallel, each one depending only on the anchor of its partition. Coalesced accesses to global memory were achieved by storing lookup data of the profile HMM contiguously.

The performance evaluation executed the GPU solution on four GPUs with NVIDIA Tesla C1060s and compared it to the Walters et al. [33] proposal running on the same platform, as well as to a serial implementation of the Viterbi algorithm running on an AMD Opteron. The experiments compared the NCBI nonredundant database [56] (with 10.54 million sequences) to three profile HMMs (with 128–507 nodes) from the Pfam database [21]. The proposed approach achieved speedups of 5–8, with respect to [33], and 100, compared to the serial implementation.

5.1.6 Ferraz and Moreano [36]

Ferraz and Moreano [36] presented an implementation of the HMMER2 hmmsearch tool on GPU, using the OpenCL [58] programming model, and evaluated several optimizations. They implemented the Viterbi algorithm based on Plan7 profile HMMs and exploited sequence parallelism. In a preprocessing step performed on the CPU, the sequences from the sequence database were sorted, based on their length. Only two rows of the score matrices were stored in GPU memory, and the GPU solution worked as a filter. Therefore if a homology hit was detected, the Viterbi algorithm was reexecuted on the CPU, followed by the traceback operation.

The memory hierarchy of the GPU was carefully studied in order to decide where to allocate the data structures. The score matrices were allocated in GPU local memory, which provided coalesced access in a more effective way and simplified index computations, in comparison to global memory. The sequences were stored in GPU global memory and organized in a way to provide coalesced access. Regular transition probabilities (for states M, I, and D) were stored in constant memory, while special transition probabilities (for special states) were kept in shared memory and replicated for each warp, so that no barrier synchronization was needed. Emission probabilities were stored in global memory.

The authors applied instruction scheduling and register renaming in the GPU implementation of the Viterbi algorithm in order to increase the distance between dependent instructions and to eliminate false dependencies between instructions. They also applied loop unrolling to the Viterbi algorithm inner loop (which iterates over the profile HMM nodes) with unroll factor 8, combined with instruction scheduling, in order to reduce the loop overhead and facilitate instruction scheduling.

The performance evaluation executed the GPU solution on a GPU NVIDIA GeForce GTX 460 and compared it to the Walters et al. [33] proposal running on the same GPU, and to the HMMER2 hmmsearch tool running on an AMD Athlon II X3. The experiments compared the entire UniProtKB/Swiss-Prot sequence database [57] (with more than 500,000 sequences) to the Top-20 profile HMMs (with 23–488 nodes) from the Pfam database [21]. The optimized solution achieved speedups of 15.3 and 49.3, on average, compared to [33] and to HMMER2, respectively. The authors also reported a maximum of 0.21, 0.67, and 10.05 GCUPS for HMMER2, [33], and the proposed GPU solution, respectively.

5.2 GPU Solutions Using the MSV Algorithm

We present here the GPU solutions for sequence-profile comparison targeting the MSV algorithm. These solutions are based on HMMER3, where the hmmsearch tool uses the MSV algorithm as its first filter in the chain of filters. The MSV algorithm is a simplified version of the Viterbi algorithm and also has quadratic time complexity; therefore it is an excellent candidate for parallelization, too.

5.2.1 Li et al. [35]

Li et al. [35] implemented the HMMER3 hmmsearch tool on GPU, using the CUDA programming model and exploiting sequence parallelism. In a preprocessing step performed on the CPU, the sequences from the sequence database were sorted, in decreasing order, based on their length. The authors developed a multithreaded solution for the CPU, which can also process sequences while the GPU is executing, overlapping GPU computation and GPU-CPU transfers. The GPU executed only the MSV algorithm, which is the first filter in the chain of filters of HMMER3, and the CPU executed the MSV algorithm and the other filters in the chain.

The score matrices of the MSV algorithm were stored in GPU global memory and organized in a way to provide coalesced access. Emission probabilities were allocated in texture memory. The outer loop of the MSV algorithm (which iterated over the sequence residues) was unrolled by a factor of 2. The score B[i] of the second unrolled iteration was computed speculatively without considering the transition $J \to B$ $J \to B$ . This way, several memory accesses were eliminated, because intermediate values are kept in registers. The correct value of B[i] was also computed and compared to the speculative value. If the speculation were to fail, the sequence would be tagged for recalculation, and the MSV algorithm would be executed again for this sequence on the CPU. According to the authors, at most, 1% of the sequences in their test cases failed the speculation.

The performance evaluation executed the GPU solution on a GPU NVIDIA Tesla C2050 attached to an Intel Xeon E5506 Quad Core and compared it to the HMMER3 hmmsearch tool running on a single core with SSE instructions enabled. The experiments compared the NCBI nonredundant database [56] (with 15.2 million sequences with lengths from 5 to 41,943) to six profile HMMs (with 63–1774 nodes) from the Pfam database [21]. The best speedup achieved was 6.5, obtained with a profile HMM of length 423.

5.2.2 Cheng and Butler [37]

Cheng and Butler [37] implemented the HMMER3 MSV algorithm on GPU, using the CUDA programming model and exploiting sequence parallelism. In a preprocessing step performed on the CPU, the sequences from the sequence database were sorted, in decreasing order, based on their length. The CPU also processed sequences while the GPU was executing, instead of staying idle while the GPU was processing. The workload was partitioned between the CPU and GPU in such a way that the longest sequences from the database were assigned to the CPU, in order to save GPU memory allocated to sequences and reduce the CPU-GPU transfer overhead. The score matrices were stored in GPU global memory and organized so that coalesced access was provided. Emission probabilities were allocated in texture memory. SIMD video instructions of the GPU were used, providing a limited form of data-parallelism.

The performance evaluation executed the GPU solution on a GPU NVIDIA Quadro K4000 attached to an Intel Core i7-3960X and compared it to the HMMER3 hmmsearch tool running on a single core with SSE instructions enabled. The experiments compared the entire UniProtKB/Swiss-Prot sequence database [57] (with 540,958 sequences with lengths from 2 to 35,213) to the Pfam family database [21] (with 14,831 profile HMMs with 7–2207 nodes). The solution achieved a speedup of 1.8, on average, compared to HMMER3. The authors also reported the results of 8.05 and 14.25 GCUPS for HMMER3 and the proposed GPU solution, respectively.

5.2.3 Araújo Neto and Moreano [50]

Araújo Neto and Moreano [50] implemented the HMMER3 hmmsearch tool on GPU, using the CUDA programming model, and evaluated several optimizations. In a preprocessing step performed on the CPU, the sequences from the sequence database were sorted, based on their length. The GPU executed the MSV algorithm, which was the first filter in the chain of filters of HMMER3, and the CPU executed the other filters in the chain. They exploited task-parallelism assigning distinct sequences from the sequence database to distinct CUDA blocks; therefore several sequences were compared simultaneously by executing several blocks concurrently on the GPU. The cells in the same row of the score matrix M were computed in parallel by the threads of a block, exploiting data-parallelism, while successive rows were calculated sequentially.

After computing each row of matrix M, a reduction operation was performed to obtain score E[i] as the maximum among the cells in this row (as described in Section 2.2.3). Usually, a reduction operation on Q values was performed in parallel, in ${log}_{2} Q$ ${log}_{2} Q$ sequential steps separated by barrier synchronizations. The authors optimized this reduction operation, applying the loop unrolling technique on the ${log}_{2} Q$ ${log}_{2} Q$ steps, in order to simplify index computations, reduce loop overhead, and for the last six steps, remove the barrier synchronizations between them, because the threads in a warp (group of 32 threads) execute synchronously.

The data structures were allocated in the GPU memory hierarchy comparing the results of several experiments. Accesses to global memory were coalesced, the score matrix was stored in shared memory, while emission probabilities are allocated in texture memory.

They applied the tiling technique to the GPU solution, with each thread being associated with several cells, instead of only one. This way, the solution could handle input cases with profile HMM model and sequence lengths longer than the maximum number of threads the GPU used, and the GPU occupancy was improved, because the workload assigned to each thread was increased.

Other optimizations applied were: loop unrolling (factor 8) of the outer loop (which iterated over the residues of the sequence) of the MSV algorithm overlapping GPU computation and CPU-GPU transfers; using pinned memory in CPU-GPU transfers; representing scores with natural numbers; vectorized access to emission probabilities; keeping frequently used data in registers; and reducing thread divergences.

The performance evaluation executed the GPU solution on a GPU NVIDIA GeForce GTX 570 attached to an Intel Core i7-3770S and compared it to the HMMER3 hmmsearch tool running on this CPU with SSE instructions enabled. The experiments compared the entire UniProtKB/Swiss-Prot sequence database [57] (with 540,732 sequences with lengths from 2 to 35,213) to 10 profile HMMs (with 256–1774 nodes) from the Pfam family database [21]. The solution achieved speedups of 5.3 and 16.0, on average, compared to HMMER3 1-core/SSE and 4-cores/SSE configurations, respectively. The authors also reported a maximum of 26.27, 82.30, and 372.06 GCUPS for HMMER3 1-core/SSE, HMMER3 4-cores/SSE, and the proposed GPU solution, respectively.

5.3 Comparative Overview

Table 3 summarizes the GPU-based sequence-profile comparison proposals discussed in this section, presenting a comparative overview of them. Only two solutions [32, 49] use more than one GPU, and three solutions [34, 35, 37] also execute the sequence-profile comparison algorithm (Viterbi or MSV) on the CPU in parallel to the GPU execution.

Table 3

Summary of GPU Solutions for Sequence-Profile Comparison

				Parallelism
Solution	GPU	Algorithm and HMM Structure	Filter	Task	Data	Maximum Speedup
Horn et al. [32]	ATI R520	Viterbi	Yes	•		26.5
		incomplete Plan7
Du et al. [44]	9800 GTX	Viterbi	Yes		•	24.5
		only M, I, D
Walters et al. [33]	8800 GTX	Viterbi	Yes	•		27.7
	Ultra	Plan7
Yao et al. [34]	8800 GTX	Viterbi	Yes	•		35.5
	Intel Core 2	Plan7
Ganesan et al. [49]	4 × Tesla	Viterbi	Yes	•	•	100.0
	C1060s	Plan7
Ferraz and	GTX 460	Viterbi	Yes	•		49.3
Moreano [36]		Plan7
Li et al. [35]	Tesla C2050	MSV	Yes	•		6.5
	Intel Xeon Quad-core
Cheng and Butler [37]	Quadro K4000	MSV	Yes	•		1.8
	Intel Core i7
Araújo Neto and Moreano [50]	GTX 570	MSV	Yes	•	•	16.0

t0020

All solutions based on the Viterbi algorithm act as a filter, retrieving only the best alignment score. Because the MSV algorithm is the first filter on the chain of filters of HMMER3, the solutions based on it produce only a similarity score, based on which the sequence is discarded or forwarded to the next filter in the chain. One of the approaches proposed in Ref. [44] perform the traceback operation in the GPU, and therefore obtained the optimal alignment. However, it was not able to handle long sequences.

The reported speedup was measured by each individual research group through different experiments. Different input data sets (profile HMMs and sequence database) were used. The GPU solutions were compared to different baseline solutions, executed on different computers (exclusively serial solution, multicore solution, solution using SIMD instructions, etc.). Unfortunately, most works do not present the throughput GCUPS results, which would enable a more adequate comparison between them. Only three solutions [36, 37, 50] present GCUPS results.

Even though the achieved speedups cannot be directly compared due to the variations in the experimental setups, some observations can be drawn from this table. Among the solutions targeting the Viterbi algorithm, the best result is produced by the only proposal [49] that exploits both task- and data-parallelism. Also, this result is based on the use of four GPUs, while all other results are based on the use of only one GPU. Among the solutions targeting the MSV algorithm, the best result is also produced by the only proposal [50] that exploits both task- and data-parallelism.

Finally, there is a significant difference between the performance results achieved by the GPU solutions based on the Viterbi algorithm, compared to those targeting the MSV algorithm. The main reason is that the results of the solutions based on the MSV algorithm are compared to HMMER3 implementation of this algorithm using SIMD SSE instructions, while the results of the solutions targeting the Viterbi algorithm are compared to a serial implementation of the algorithm or to HMMER2 implementation, without SIMD instructions.

6 Conclusion and perspectives

In Bioinformatics, we find several problems which are affected by huge processing times and memory consumption because of the sizable biological data sets and the inherent complexity of the corresponding algorithms. Therefore exploiting parallelism approaches becomes necessary in problems such as sequence comparison. Current GPUs are highly parallel platforms, offering performance and power efficiency at comparatively low cost. In this chapter we showed how GPUs can be used efficiently to accelerate pairwise sequence comparison and sequence-profile comparison problems.

Even though the first GPU-based solutions for sequence comparison appeared in 2005 and significant speedups have been obtained since then, this is still a very active research area, with many challenging problems yet to be investigated. An important issue that must be addressed is the long development time to produce correct and high-performance GPU programs. Although the software tools and programming models (such as CUDA and OpenCL) have made programming on GPUs much easier than before, the design of an efficient GPU solution still requires both a long development time and good skills in parallel programming, algorithm and compiler optimization techniques, and computer architectures. Higher-level programming models (such as OpenACC) offer an alternative; however, they also bring a trade-off between productivity and high performance.

Today’s developers face a major current problem with performance portability on high-performance platforms. Because architectural features, such as memory resources, vary among different GPU generations, performance portability has become a serious problem for developers.

Usually, the performance and scalability of the GPU solutions are limited by GPU memory capacity and by the bandwidth of the CPU-GPU connecting bus. If the bandwidth to feed the GPU with sequences and profile HMMs is insufficient, it will likely become a severe bottleneck, with the GPU remaining bandwidth-constrained and unable to process data at its maximum potential capacity.

A current trend in high-performance computing is the adoption of heterogeneous computing systems, that is, large-scale computing clusters of heterogeneous nodes equipped with multicore CPUs, GPUs, and FPGAs. However, such systems require a combination of multiple and different programming paradigms, making application development very challenging. Therefore there is a need for a complete programming framework that comprehends the difficulties of parallel systems with heterogeneous computing units and that combines a simplified programming model, support for different devices, and support for high-performance internode communication.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6: Biological sequence analysis on GPU

Create new playlist

Sign In

Sign Up