1、n呂學一(中央研究院 資訊科學所)nwww.iis.sinica.edu.tw/hil/hil/nDont forget to do Homework 2!nConstructing suffix tree in linear time.nApplications of suffix trees Substring problem(暖身)“Exact string matching”revisited Linearization of circular string(挪移乾坤)Longest common substring(異中求同)nIntermission 小巨s magic show:
2、“flaming”nCase-1 StepnAlways occurs at a leaf(growing point).nAt the beginning of iteration k,the path above a leaf has label?,k1.nEach Case-1 step in iteration k simply changes the label?,k1 to?,k.?,k1?,knUse?,-to label the path above each leaf.Thus,no need to do anything for each case-1 step.?,-nC
3、ase-2 Steps:節外生枝nAlways occurs at a growing point that is not a leaf,which is not necessarily an internal node.nWhen the growing point is an internal node Label=k,-,where k is the current iteration index.k,-nWhen the growing point is not an internal node k,-ti,ji+t,ji,i+t-1nCase-3 Steps:若無其事nAlways
4、occurs at a growing point that is not a leaf,which is not necessarily an internal node.nWhen the new position of the growing point is an internal node ti,j0nWhen the new position of the growing point is not an internal node ti,jt+1S=b b a b b a a b1,-2,-3,-3,-3,37,-4,-3,37,-4,-2,37,-4,-11,11 2111nWe
5、 can simply ignore all Case-1 steps.nRecall that the number of Case-2 steps is at most|S|.nQ:Is this good enough?Case 1:長此以往Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b nGive a sequence S such that the number of Case-3 steps throughout
6、the process of growing its suffix trie(or suffix tree)is W(|S|2).nCompletely ignore themn以若無其事的態度處理若無其事的狀況ti,j0ti,jt+11.For correctly maintaining the position of each growing point.(Why?)2.For correctly running Case-2 steps.(By what?)k,-ti,ji+t,ji,i+t-1nSaving the book keeping efforts in all Case-3
7、steps Case 1:長此以往Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b nJust keep one current growing point throughout the execution.nDeriving the new position of the current growing point from its previous position(with the help of suffix links
8、(斷頭指標)Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b The challenges:How do we derive the position of the current growing point?Vertically(case 2)Horizontally(case 3)nQ:Which one is easier?nMoving from iteration k 1 to iteration k.nThe gro
9、wing point does not move!nThis is the easier case.Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b 1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b Case 1:長此以往Case 2:節外生枝Case 3:若無其事nMoving from Step i t
10、o Step i+1 in the same iteration.nThe growing point moves dramatically.nThis is the tougher case.Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b 1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b Case 1:
11、長此以往Case 2:節外生枝Case 3:若無其事n前人種樹後人涼的哲學n每次千辛萬苦找到vertical movement的目的時,把這個movement的起點與終點用一個link記錄下來.n下回遇到這個起點時,就可以直接走到終點去,不用再重新找一次.n這些link就叫做suffix link(斷頭指標).n終點所對應的字串,是起點所對應之字串的斷頭字串(second suffix)Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a
12、b b n每個斷頭指標的起點一定是個internal node,不會葉子 不會是某個suffix tree edge的中間nWhy?Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b n每個internal node一定是某個斷頭指標的起點nWhy?Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a
13、 a b a a b a b b S=b b a b b a a b1,-2,-3,-3,-3,37,-4,-3,37,-4,-2,37,-4,-11,1121111 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b n遇到必須節外生枝(case 2),就在長好新樹枝之後,移到斷頭位置,繼續從那裡檢查目前這個Sk是case 2還是case 3.n遇到若無其事(case 3),就直接按照Sk的方向移動,繼續從那裡檢查下一個字元Sk+1是case 2還是case 3.11,13,
14、1,12,3,127,2,34,7,14,17,3,34,1210,4,56,110,2,23,323453,3nGoing up to a closest internal node(whose suffix link must be available).Suppose this upward traversal passes through t characters.nFollowing the suffix link that starts from this internal node.ti,jnGoing down by matching the t-character subst
15、ring Si,i+t 1 of S.ti,jnNavely:O(t).nCleverly:O(1+d),where d is the number of internal nodes being went through during phase(2).ti,jnSuppose di is the d in the i-th Case-2-step traversal.nIt suffices to show d1+d2+d|S|=O(|S|).Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a
16、 a b b b a a b b a a b a a b a b b nThe slack means the distance between a position P and the closest internal node above P.ti,jnEach case-3 traversal(i.e.,horizontal movement)can only increase the value of by at most one.n(It can even decrease the value of.)Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b
17、a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b nThe i-th case-2 traversal(i.e.,vertical movement)decreases the value of by at least di.Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b nInitial =O(1).n can be increased b
18、y one for at most|S|times(because there are at most|S|horizontal movements(i.e.,case-3 traversals).nSince is always non-negative,the above bound is proved.nConstructing suffix tree in linear time.nApplications of suffix trees Substring problem(暖身)“Exact string matching”revisited Linearization of cir
19、cular string(挪移乾坤)Longest common substring(異中求同)nIntermission 小巨s magic shownSubstring Problem(recap as a warm-up)nInput:two strings P and S,where S is allowed to be preprocessed in O(|S|)time.nOutput:an occurrence of P in S.nObjective:done in O(|P|)time.1,13,1,12,3,127,2,34,7,14,17,3,34,13,31,17,2,
20、34,7,4,7,3,34,3,3Q:Where are abba,baa,bb?121,1347,2,34,57,4,67,3,34,3,33211Q:Where are abba,baa,bb?nExact String MatchingnInput:two strings P and S,where S is allowed to be preprocessed in O(|S|)time.nOutput:all occurrences of P in S.nChallenge:solving this in O(|P|+k)time,where k is the number of o
21、ccurrences of P in S.nEach internal node keeps the labels of all its descendant leaves.121,1347,2,34,57,4,67,3,34,3,3Q:Somethings missing?6,35,24,15,2,4,1Q:How do we fix this problem?121,1347,2,34,57,4,67,3,34,3,35,24,15,2,4,1,8179,4,45,89,99,6,3,7Q:Obtainable in O(|S|)time?nS=a a a a a$1654321,2,3,
22、4,51,2,3,41,2,31,2nConsider the sequence L of leaves from left to right.The descendant leaves of each internal node has to be consecutive in L.121,1347,2,34,57,4,67,3,33,35,24,15,2,4,1,879,4,45,89,99,6,3,71,34,84,56,7nCircular String Linearization(挪移乾坤)nLet挪(S,i)denote the string Si|S|S1i 1.Si挪(S,i)
23、nInput a string S.nOutput an index i that maximizes the alphabetical order of 挪(S,i).1 2 3 4 5 6 7 8挪(S,1)=b b a b b a a b挪(S,2)=b a b b a a b b挪(S,3)=a b b a a b b b挪(S,4)=b b a a b b b a挪(S,5)=b a a b b b a b挪(S,6)=a a b b b a b b挪(S,7)=a b b b a b b a挪(S,8)=b b b a b b a alet j=1;for i=2 to|S|do
24、if(挪(S,i)挪(S,j)let j=i;output j;Time complexity?1 2 3 4 5 6 7 8挪(S,1)=b b a b b a a b挪(S,2)=b a b b a a b b挪(S,3)=a b b a a b b b挪(S,4)=b b a a b b b a挪(S,5)=b a a b b b a b挪(S,6)=a a b b b a b b挪(S,7)=a b b b a b b a挪(S,8)=b b b a b b a a1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b
25、a a b b a a b a a b a b b Q:How to fix the problem?nSuffix tree for SSnEach length-|S|substring of SS is a 挪(S,j)for some index j with 1 j|S|.nEach 挪(S,j)with 1 j|S|is a length-|S|substring of SS.11,13,1,12,3,127,2,34,7,14,17,3,34,1210,4,56,110,2,23,323453,31,17,4,7,4,7,3,310,4,56,10,2,23,33,3Q:How
26、to use this suffix tree?nOutput the index i such that SSi|SS|corresponds to the rightmost leaf of the suffix tree for SS.Clearly,this takes O(|S|)time.nLongest common substring(異中求同)nInput:two strings A and B.nOutput:a longest string C that occurs in both A and B.nA=b b b a b b a a bnB=b a a b b a b
27、 b a bnC=b bnC=b a a bnC=a b b anC=b b a b b abuild suffix tree for B;for L=|A|downto 1 do for i=1 to|A|-L+1 do if Aii+L-1 occurs in B output Aii+L-1 and halt;output“no common substring”;Time complexity?()3|12|1|1|)()1|(|AOAiiAOiOiAALALAL=+-=+-=nThe for-loop takes timeCan we do better than this?buil
28、d suffix tree for B;for i=1 to|A|do find the largest integer L(i)such that Aii+L(i)-1 occurs in B by binary search;output AiL(i)for the i with the largest L(i);Time complexity?nThe for-loop takes O(|A|2 log|A|)time.Each binary search takes time O(|A|log|A|).There are overall O(|A|)binary searches.Ca
29、n we do better than this?nit is impossible to solve this longest common substring problem in O(|A|+|B|)time.nin O(|A|+|B|)time via suffix treenConstruct a suffix tree T for A#B$,where#and$are two characters not in A and B.nThere are exactly|A|+|B|+2 leaves in T,each leaf corresponds to a suffix of A
30、#B$.A-leaf:with label in 1,2,|A|ncorresponds to an A-suffix.B-leaf:with label in|A|+2,|A|+|B|+1ncorresponds to a B-suffix.$#ABA-suffixB-suffixnLet v be an arbitrary position of T(i.e.,v is not necessarily a node of T.)v has a descendant A-leaf if and only if v corresponds to a prefix of an A-suffix
31、of A#B$.v has a descendant B-leaf if and only if v corresponds to a prefix of a B-suffix of A#B$.rootvnLet v be a position of T.v has descendant A-leaf and B-suffix if and only if v corresponds to a common substring of A and B.root$#ABA-suffixB-suffixvnDo we really need#to separate A and B in the co
32、ncatenated string A#B$?root$ABA-suffixB-suffixvnConstruct the suffix tree T of A#B$.nOutput the string corresponding to a deepest internal node v such that the subtree of T rooted at v contains both A-leaf and B-leaf.nQ:why not checking leaves?nQ:why not checking positions of T that are not internal
33、 nodes of T?nIf the position v contains both kinds of descendant leaves,then so does its closest internal node below.rootvnO(|A|+|B|)time for constructing T.nO(|A|+|B|)time for marking the colors of each node,including each leaf and each internal nodesnO(|A|+|B|)time for computing the depths of all
34、nodesnO(|A|+|B|)time to find a deepest internal node with both colors.nQ:Can we further improve the time and space complexity?n“No”for the time complexity.n“Yes”for the space complexity.nInput:two strings A and B.nOutput:a longest string C that occurs in both A and B.nObjective:O(|A|+|B|)time O(|A|)
35、spacenIdea:Construct the suffix tree of A only.nConstruct the suffix tree T of A,keeping all the suffix links.nFor i=1 to|B|do Find the largest integer 深(i)such that Bii+深(i)1 occurs in B.nOutput Bii+深(i)1 where i is the index with maximum 深(i).nFinding 深(i)for each i takes O(深(i)+1)time by traversi
36、ng T from the root.nBut all these|B|iterations would take O(|A|B|)time in total.depth=深(i)n深(i+1)深(i)1.深(i)深(i)1What if this suffix link does not exist?121,1347,82,34,857,84,867,83,34,83,31Record:深(1)=31Record:深(3)=4Record:深(5)=6nClearly,the space complexity is O(|A|).nThe time complexity still O(|A
37、|+|B|).We first show that the time is O(|A|+|B|)without considering that of suffix link traversal.We then show that the time for suffix link traversal is also O(|A|+|B|).nThe for-loop has exactly|B|iterations.nSuppose the i-th iteration moves the arrow to the right by d(i)units.nd(1)+d(2)+d(|B|)=|B|
38、,because the arrow never goes left.nThe i-th iteration takes time O(d(i)+1).nSo,the overall time complexity is O(|B|).nLet(i)denote the distance between the position of at the end of the i-th iteration and the closest internal node above.nLet t(i)be the number of internal nodes touched in the downward suffix link traversal of the i-th iteration.n(i)(i 1)+d(i)t(i).d(i)t(i)nt(i)(i 1)+d(i)(i)nTherefore,t(1)+t(2)+t(|B|)is at most d(1)+d(2)+d(|B|)+(0)(|B|),which is clearly O(|B|+|A|).(0)|A|,and d(1)+d(2)+d(|B|)=|B|.