Finding the Longest Common Subsequence of DNA Strands

The Problem: Finding Common Ground in DNA

In bioinformatics, comparing DNA sequences is a fundamental task. One common problem is to find the longest common subsequence (LCS) between two sequences. This can help us understand the similarity between two DNA strands, which can have implications for everything from genetic disease to evolutionary biology.

This blog post walks through a C++ implementation I wrote to solve the LCS problem for two DNA sequences. The implementation follows the classic dynamic programming approach, which is a powerful technique for solving problems that can be broken down into smaller, overlapping subproblems.

The Strategy: Dynamic Programming

At its core, the LCS problem has an optimal substructure, meaning that the optimal solution to the problem contains within it optimal solutions to subproblems. This makes it a perfect candidate for dynamic programming.

The basic idea is to build a table that stores the lengths of the LCS for all possible prefixes of the two input strings. Let’s say our input strings are X and Y. We create a table C where C[i][j] is the length of the LCS of X[1...i] and Y[1...j].

We can compute the values in this table as follows:

If X[i] == Y[j], then the LCS of X[1...i] and Y[1...j] is the LCS of X[1...i-1] and Y[1...j-1] followed by X[i]. So, C[i][j] = C[i-1][j-1] + 1.
If X[i] != Y[j], then the LCS of X[1...i] and Y[1...j] is the longer of the LCS of X[1...i-1] and Y[1...j] and the LCS of X[1...i] and Y[1...j-1]. So, C[i][j] = max(C[i-1][j], C[i][j-1]).

The Implementation

This implementation follows this logic closely. Let’s break it down.

Reading the Input

First, we need to get our two DNA sequences. The read_sequence function handles this. It’s a simple utility that reads two lines from a file named dna.txt, which has 2 lines, each representing a DNA sequence.

void read_sequence(const std::string path, std::string &s1, std::string &s2) {
    /* Open file */
    std::ifstream file(path);
    std::string line;

    /* Ensure file can be opened */
    if (!file.is_open()) {
        std::cout << "File not found" << std::endl;
        return;
    }
    /* Get 2 lines from file to compare */
    getline(file, s1, '\n');
    getline(file, s2, '\n');
}

The `lcs` Function

The lcs function is where the dynamic programming magic happens. It takes the two sequences, s1 and s2, as input and creates two 2D vectors: c_matrix and b_matrix.

c_matrix stores the lengths of the LCS for the subproblems, as described above.
b_matrix is a helper matrix that stores the “direction” we came from to get the optimal solution at each step (diagonal, up, or left). This is crucial for reconstructing the LCS later.

void lcs(std::string s1, std::string s2) {
    unsigned m = s1.size();
    unsigned n = s2.size();

    /* init b and c matrices to necessary size */
    std::vector<std::vector<int>> b_matrix(m, std::vector<int> (n));
    std::vector<std::vector<int>> c_matrix(m+1, std::vector<int> (n+1));

    /* init c's first row and column to 0s (this is wholly unnecessary, C++ initializes all values to 0 by default.)*/
    for (unsigned i = 0; i <= m; i++)
      c_matrix[i][0] = 0;
    for (unsigned j = 0; j <= n; j++)
      c_matrix[0][j] = 0;

    /* Filling matrices */
    for (unsigned i = 1; i < m + 1; i++) {
        for (unsigned j = 1; j < n + 1; j++) {
            if (s1.at(i - 1) == s2.at(j - 1)) {
                c_matrix[i][j] = c_matrix[i - 1][j - 1] + 1;
                b_matrix[i - 1][j - 1] = DIAGONAL;
            } else if (c_matrix[i - 1][j] >= c_matrix[i][j - 1]) {
                c_matrix[i][j] = c_matrix[i - 1][j];
                b_matrix[i - 1][j - 1] = UP;
            } else {
                c_matrix[i][j] = c_matrix[i][j - 1];
                b_matrix[i - 1][j - 1] = LEFT;
            }
        }
    }
    /* Print the sequence found */
    print_lcs(s1, b_matrix, m - 1, n - 1);
}

Reconstructing the LCS

Once the c_matrix and b_matrix are filled, the print_lcs function is called to reconstruct the actual LCS. It does this by backtracking through the b_matrix, starting from the bottom-right corner. If the direction is DIAGONAL, it means we found a character in the LCS. If it’s UP or LEFT, we move accordingly without printing a character.

This is a recursive function, which makes the backtracking logic clean and easy to understand.

void print_lcs(std::string s, std::vector<std::vector<int>> b_matrix, unsigned i, unsigned j) {
    /* Base case, print character at i, then return */
    if (i == 0 || j == 0) {
        std::cout << s.at(i);
        return;
    }
    /* Recursive cases */
    if (b_matrix[i][j] == DIAGONAL) {
        // Case 1, s1[i] == s2[i], Go Diagonal
        print_lcs(s, b_matrix, i - 1, j - 1);
        std::cout << s.at(i);
    } else if (b_matrix[i][j] == UP) {
        // Case 2: Go Up
        print_lcs(s, b_matrix, i - 1, j);
    } else {
        // Case 3: Go Left
        print_lcs(s, b_matrix, i, j - 1);
    }
}

Complexity

Time Complexity: O(m × n)

We fill a table with m×n entries, and each entry takes constant time to compute. print_lcs() is $O(m + n)$ in the worst case, as it follows a path from m - 1, n - 1, to 0, 0. If it is moving diagonally every time, it takes $O(\min(m, n))$, which is the best case. In the worst case it would never move diagonally, which would be $O(m + n)$.

Space Complexity: O(m × n)

We are storing both the c_matrix and b_matrix. It is possible to optimize this to $O(\min(m, n))$, but only if we only need the length and not the actual sequence, since we’re reconstructing the LCS, we need to keep the full b_matrix.

The Result

After running the program with two randomly generated DNA strings, I verified the result with an online LCS tool, and it matched perfectly. This confirms that the implementation is correct. Here is a sample output with two 1000 character long sequences:

❯ g++ -o out hw4.cpp                            ~/Projects/longest-common-sequence 

❯ ./out                                         ~/Projects/longest-common-sequence 
TACATTTTTCGGTTAGAGACGGTAGGAAGCAATCTCGGGTTCGCTACTGGGGACACTGTCCCCATCCGCTCTGGTTGGTCAT
TTCAGTTCAAGCGCGATTTAGTACGCCGATAAGCACGTTTCAGGTGAGATAGCTCTCAAATTAAGTGGTCTTAACCCGGGTT
AATTTCATCGATTCTCCCTAATTAAGGGGAATTTCCGTCTCTGACGTTAGACCAAGACGTTGTTTCTTCCAGAGCGTAAAGT
CGTGACCACAGTATTCGCAGCATGGGGAGAAGCGAGCCACTGTTGGGTTGTCACCGGCAACCGATCAGAGGTGGTAAGCCTA
AAAGTTATGCCCCCCCTACCCTCATACAGGTAGGAGGTTAGCTTAATGGATTCGCGTTTCTGCTACAAAGTTTAAAAACTTA
CTTATTTCCTGGTTTAATAGCTGACGATCTGCGTTTGTTGAAACCTCAGTCTTGTCTCGTCCGTGAACCCGTCAATCTTCTG
CGGCCCCTTTGTGCAGGTCCCGGAGAATATTGAAACTACGCTCGGGACCGTCAAGGAATTGCGCTTTGGCGGTGGTGCACCC
TCGAGCGAATCAATTTTGCTATAGCGGCGTCGCTATAAAGACTTTCAGGCAGACACTTTGTAAACGCAAT%                                                                                                                                                                       
❯                                       ~/Projects/longest-common-sequence took 4s

Conclusion

This project was a great exercise in implementing a classic dynamic programming algorithm. It’s a powerful technique that has applications in many areas of computer science, and seeing it work in a real-world problem like DNA sequence comparison is very rewarding.

The Problem: Finding Common Ground in DNA#

The Strategy: Dynamic Programming#

The Implementation#

Reading the Input#

The lcs Function#

Reconstructing the LCS#

Complexity#

Time Complexity: O(m × n)#

Space Complexity: O(m × n)#

The Result#

Conclusion#