subsequence - 演算法筆記

longest common subsequence

longest common subsequence（LCS）

「最長共同子序列」。出現於每一個序列、而且是最長的子序列。可能有許多個。

s1: 2 5 7 9 3 1 2
s2: 3 5 3 2 8

LCS(s1, s2) = 5 3 2

s1: a b c d b c e e a
s2: c a b d e f g a
s3: d c e a

LCS(s1, s2, s3) =  {c e a, d e a}

求出兩個序列的LCS，P問題。接下來將介紹各種演算法。

求出一群序列的LCS，NP-hard問題，沒有快速的演算法。

簡單的方式是窮舉法：窮舉s1的所有子序列，檢查s2...sN是否都有該子序列。時間複雜度O(s1! ⋅ (s2 + ... + sN))。

longest common subsequence: dynamic programming

演算法（Needleman–Wunsch algorithm）

削減尾端元素，遞迴縮小問題。

s1和s2的最後一個元素，叫做e1和e2，拆解出來。

s1 = sub1 + e1
s2 = sub2 + e2

四種情況：

一、LCS包含e1不含e2：LCS(s1, s2) = LCS(s1, sub2)
二、LCS不含e1包含e2：LCS(s1, s2) = LCS(sub1, s2)
　　（e1毫無用處，想找LCS只需要找sub1。）
三、LCS不含e1 e2：LCS(s1, s2) = LCS(sub1, sub2)
四、LCS包含e1 e2：LCS(s1, s2) = LCS(sub1, sub2) + e1
　　（LCS的最後一個元素一定是e1，也等於e2。）

遞迴公式：

LCS(s1, s2) =
 { max( LCS(sub1, s2), LCS (s1, sub2), LCS(sub1, sub2) )
 {                                       when e1 ≠ e2
 { LCS(sub1, sub2) + e1                  when e1 = e2

簡化：

LCS(s1, s2) =
 { max( LCS(sub1, s2), LCS(s1, sub2) )   when e1 ≠ e2
 { LCS(sub1, sub2) + e1                  when e1 = e2

初始值：

LCS(s1, s2) = Ø                          when s1 = Ø or s2 = Ø

時間複雜度O(NM)，空間複雜度O(NM)。N和M是序列長度。

計算longest common subsequence的長度

二維陣列length[i][j]，代表「s1前i個元素」和「s2前j個元素」的LCS長度。

有個節省記憶體空間的方法。填表格，只需要上格、左格、左上格。計算順序，設定為從左往右、再從上往下，如此只需要一條陣列（上排）、以及一個數值（左上格）。

長序列為直向，短序列為橫向，陣列長度最短。空間複雜度降為O(min(N,M))。

找出一個longest common subsequence

使用一個二維陣列，記錄每一格的結果是從哪一格而來。

往回追溯，每當發現某一格length[i][j]是由length[i-1][j-1] + 1而來，就印出s1[i]（也是s2[j]）。

有個節省記憶體空間的方法。只計算LCS長度。填表格，自頂往中央，自底往中央，相遇於中央，將答案相加。中央橫排，找到最大值位置，分成左上區與右下區，遞迴縮小問題。

遞迴下去的區域，只佔半個表格。整個遞迴過程，1 + (1/2) + (1/4) + (1/8) + ... ≤ 2，只佔兩個表格。時間複雜度升為兩倍。空間複雜度降為O(N+M)。

const int N1 = 7, N2 = 5;
int s1[N1+1] = {0, 2, 5, 7, 9, 3, 1, 2};
int s2[N2+1] = {0, 3, 5, 3, 2, 8};
int length[N1+1][N2+1];	// DP表格
int prev[N1+1][N2+1];	// 記錄每一格的的結果是從哪一格而來
int lcs[min(N1,N2)];

void LCS()
{
	for (int i=0; i<=N1; i++) length[i][0] = 0;
	for (int j=0; j<=N2; j++) length[0][j] = 0;

for (int i=1; i<=N1; i++)
		for (int j=1; j<=N2; j++)
			if (s1[i] == s2[j])
			{
				length[i][j] = length[i-1][j-1] + 1;
				prev[i][j] = 0;	// 左上格
			}
			else
			{
				if (length[i-1][j] < length[i][j-1])
				{
					length[i][j] = length[i][j-1];
					prev[i][j] = 1;	// 左格
				}
				else
				{
					length[i][j] = length[i-1][j];
					prev[i][j] = 2;	// 上格
				}
			}

print_LCS(N1, N2);
}

// 遞迴版本
void print_LCS(int i, int j)
{
	// 任一個序列為空集合就停止
	if (i == 0 || j == 0) return;

if (prev[i][j] == 0)		// 左上格
	{
		print_LCS(i-1, j-1);
		cout << s1[i];	// 印出LCS的元素
	}
	else if (prev[i][j] == 1)	// 左格
		print_LCS(i, j-1);
	else if (prev[i][j] == 2)	// 上格
		print_LCS(i-1, j);
}

// 迴圈版本，LCS順序顛倒。
// 暫時儲存於其他陣列，之後再依序印出。
void print_LCS(int i, int j)
{
	int l = length[i][j];			// LCS長度
	while (l > 0)
		if (prev[i][j] == 0)		// 左上格
			l--, lcs[l] = s1[i];
		else if (prev[i][j] == 1)	// 左格
			j--;
		else if (prev[i][j] == 2)	// 上格
			i--;

l = length[i][j];
	for (int i=0; i<l; ++i)
		cout << lcs[i];
}

找出所有的longest common subsequence

填表格的時候，LCS可能同時來自上格、左格、左上格，將這些情形全部記錄下來。往回追溯的時候，將這些情形通通追溯一遍，求出所有LCS。 LCS數量最多O(2ᴺ)個。

UVa 111 531 10066 10192 10405 10723

演算法（Hirschberg's algorithm）

表格分割成左上區、右下區，遞迴縮小問題。前面已介紹。

時間複雜度O(NM)。空間複雜度O(N+M)。

演算法（four-Russians speedup）

表格切割成小方塊，小方塊直接查表。小方塊邊長logN / 4。

時間複雜度O(N² / logN)。不無小補。

longest common subsequence: Hunt–Szymanski algorithm

LCS問題化作二維LIS問題

兩序列找出相同元素，表示成位置數對。

兩序列的LCS＝位置數對的2D LIS。

(1) LCS problem:

  index: 0 1 2 3 4 5 6 7 8 9
  --------------------------
  s1:    a b a c d
  s2:    d b a a b c a

(2) matched positions:

      d b a a b c a
    +---------------       a     a     a     b     b  
  a |     * *     *      (0,2) (0,3) (0,6) (1,1) (1,4)
  b |   *     *
  a |     * *     *        a     a     a     c     d
  c |           *        (2,2) (2,3) (2,6) (3,5) (4,0)
  d | *

(3-1) { LCS == 2D LIS } example:

      d b a a b c a
    +---------------
  a |     x *     *      LCS    :   a     b     a
  b |   *     x          2D LIS : (0,2) (1,4) (2,6)
  a |     * *     x
  c |           *        數對呈嚴格遞增，在表格中則是往右下走。
  d | *

(3-2) { LCS == 2D LIS } another example:

      d b a a b c a
    +---------------
  a |     * *     *      LCS    :   b     a     c
  b |   x     *          2D LIS : (1,1) (2,2) (3,5)
  a |     x *     *
  c |           x        數對呈嚴格遞增，在表格中則是往右下走。
  d | *

二維LIS問題化作一維LIS問題

排序所有數對：第一維由小到大排序；當第一維平手，則第二維由大到小排序。

排序之後，原本數對們的2D LIS＝第二維的1D LIS。

(1) sort all pairs:

    a     a     a     b     b     a     a     a     c     d
  (0,6) (0,3) (0,2) (1,4) (1,1) (2,6) (2,3) (2,2) (3,5) (4,0)

  increasing in 1st components.
  decreasing in 2nd components if ties.

(2) 2nd components:

  6 3 2 4 1 6 3 2 5 0

(3) { 2D LIS == 1D LIS } example:

  6 3 2 4 1 6 3 2 5 0
      * *   *

  LCS    :   a     b     a
  2D LIS : (0,2) (1,4) (2,6)
  1D LIS :   2     4     6
  
(4) all in one:

     a b a c d                            s1
  -> a(6,3,2) b(4,1) a(6,3,2) c(5) d(0)   sorted matched positions
  -> 6 3 2 4 1 6 3 2 5 0                  2nd component
  ->     2 4   6                          LIS

演算法

LCS問題，化作2D LIS問題，再化作1D LIS問題，最後套用Robinson–Schensted–Knuth algorithm。

排序所有數對，使用counting sort。掃描一遍s2，把每個字元的位置紀錄下來。

較短的序列當作s1，時間複雜度是O(Klog(min(N,M)) + R)。K是數對數目，N和M是序列長度，R是數字範圍。

K至多是NM，最差情況下比先前的演算法還慢，平均情況下比先前的演算法快上許多。R源自counting sort。

計算longest common subsequence的長度

UVa 10635 10949

LIS問題與LCS問題可以互相轉換

LIS轉LCS：令原序列A排序後變成B。A和B的LCS，就是A的LIS。

LCS轉LIS：將序列A和B當中的相同字母配對通通找出來，表示成位置數對，以特殊規則排序，然後找到LIS，就是A和B的LCS。

longest common increasing subsequence

longest common increasing subsequence（LCIS）

嚴格遞增的LCS。LCS的DP演算法稍作修改，隨時找到最有潛力的LCIS。時間複雜度仍是O(NM)。

      4 2 1 4 2 3 1        4 2 1 4 2 3 1        4 2 1 4 2 3 1 
    +---------------     +---------------     +---------------
  1 |     *       *    1 |     x       *    1 |     x       * 
  2 |   *     *        2 |   *     x        2 |   *     x     
  4 | *     *          4 | *     *          4 | *     *       
  3 |           *      3 |           x      3 |           *   
  4 | *     *          4 | *     *          4 | *     *       

   all matched pairs            LCIS        best-candidate LCIS
                                           when visit pair (3,3)

UVa 12511

cyclic longest common subsequence

ICPC 5890

shortest common supersequence

「最短共同父序列」。包含了每一個序列，而且是最短的父序列。可能有許多個。

求出一群序列的SCS，NP-hard問題，沒有快速的演算法。

求出兩個序列的SCS，P問題，dynamic programming。

SCS長度＋LCS長度＝兩個序列長度總和。

UVa 10723