Tuesday 13 December 2011

Classic Longest common subsequence via Dynamic programming

Run time Profiling results of classic longest common subsequence algorithm. 

An example application of this algorithm is finding the difference of two text files. We need to find the longest common subsequence of two sequences. i.e for "bdca" and "bcbda" the longest common subsequence is of length 3 and is  "bda"
The core algorithm using dynamic programming for this is as follows

Lcs(i,j) = {      Lcs(i-1, j-1) + 1 ; if match at i & j.
               Max( Lcs(i-1, j), Lcs(i, j-1) ); if mismatch at i & j
        }

This can be implemented recursively using a two dimensional array m x n where m, n are sequence lengths. For example, two sequences "bdca" and "bcbda" we have the matrix for the algorithm above

0 1 2 3 4

b c b d a
     --------------------------------------
0 b  | 1   1   1   -1   -1 
     |
1 d  | 1   1   1   2   -1
     |
2 c  | -1   2   2   2   -1   
     |
3 a  | -1   -1   -1   -1   3
     --------------------------------------  
   
So here 3 is the length of the longest common subsequence. base on the core algorithm above.
Runtime complexity is about O(mn). This particular basic program yeilds a time as show here

Improvement:

By analysing the recurssion tree of this particular problem there is an improvement that can be made to the runtime. The addition to the core algorithm is as follows

If Lcs(i, j) has not been calculated so far ?

    Lcs(i,j) = { 
                        Lcs(i-1, j-1) + 1 ; if match at i & j.
        Max( Lcs(i-1, j), Lcs(i, j-1) ); if mismatch at i & j
            }

else return the calculated Lcs(i, j);

This small addition saves the code from going to calculate duplicate recursion trees. The time saved (~ 30 ms) is evident here

The sample recurssion tree is shown below. There are a lot of repeated duplicate tasks.

     4, 4

   3,4     4, 3

   2, 4, 4,3          3, 3   4, 2   

1, 4 2, 3 .... 2, 3   3,2 .... 3, 2 4,1
     
Reconstruction is based again on the algorithm. We backtrack based on the two steps of the algorithm. ie if match occurred or the Max of sub-problems.

1. We start at the right end of the tree
2. if there was a match at i & j then we take sequence[i] (= sequence[j] as in step 1 of algorithm) and go to (i-1, j-1)
3. else we can go either to (i-1, j) or (i, j-1) in the matrix. The logical thing to do is to go where there is maximum value.
4. if matrix values at  (i-1, j) or (i, j-1) are same we may have to investigate both.

The disadvantage of this approach is that, as the length of the sequences go up the memory for the matrix also goes up as m x n. One way to improve the memory is to use only two rows since we look at only those every time. But, we may need to keep track of which way we step to reconstruct the lcs sequence. 

No comments: