生物资讯相关演算法AlgorithmsinBioinformatics-课件.ppt

上传人(卖家):晟晟文业 文档编号:5205248 上传时间:2023-02-17 格式:PPT 页数:95 大小:735.06KB
下载 相关 举报
生物资讯相关演算法AlgorithmsinBioinformatics-课件.ppt_第1页
第1页 / 共95页
生物资讯相关演算法AlgorithmsinBioinformatics-课件.ppt_第2页
第2页 / 共95页
生物资讯相关演算法AlgorithmsinBioinformatics-课件.ppt_第3页
第3页 / 共95页
生物资讯相关演算法AlgorithmsinBioinformatics-课件.ppt_第4页
第4页 / 共95页
生物资讯相关演算法AlgorithmsinBioinformatics-课件.ppt_第5页
第5页 / 共95页
点击查看更多>>
资源描述

1、n呂學一(中央研究院 資訊科學所)nwww.iis.sinica.edu.tw/hil/hil/nDont forget to do Homework 2!nConstructing suffix tree in linear time.nApplications of suffix trees Substring problem(暖身)“Exact string matching”revisited Linearization of circular string(挪移乾坤)Longest common substring(異中求同)nIntermission 小巨s magic show:

2、“flaming”nCase-1 StepnAlways occurs at a leaf(growing point).nAt the beginning of iteration k,the path above a leaf has label?,k1.nEach Case-1 step in iteration k simply changes the label?,k1 to?,k.?,k1?,knUse?,-to label the path above each leaf.Thus,no need to do anything for each case-1 step.?,-nC

3、ase-2 Steps:節外生枝nAlways occurs at a growing point that is not a leaf,which is not necessarily an internal node.nWhen the growing point is an internal node Label=k,-,where k is the current iteration index.k,-nWhen the growing point is not an internal node k,-ti,ji+t,ji,i+t-1nCase-3 Steps:若無其事nAlways

4、occurs at a growing point that is not a leaf,which is not necessarily an internal node.nWhen the new position of the growing point is an internal node ti,j0nWhen the new position of the growing point is not an internal node ti,jt+1S=b b a b b a a b1,-2,-3,-3,-3,37,-4,-3,37,-4,-2,37,-4,-11,11 2111nWe

5、 can simply ignore all Case-1 steps.nRecall that the number of Case-2 steps is at most|S|.nQ:Is this good enough?Case 1:長此以往Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b nGive a sequence S such that the number of Case-3 steps throughout

6、the process of growing its suffix trie(or suffix tree)is W(|S|2).nCompletely ignore themn以若無其事的態度處理若無其事的狀況ti,j0ti,jt+11.For correctly maintaining the position of each growing point.(Why?)2.For correctly running Case-2 steps.(By what?)k,-ti,ji+t,ji,i+t-1nSaving the book keeping efforts in all Case-3

7、steps Case 1:長此以往Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b nJust keep one current growing point throughout the execution.nDeriving the new position of the current growing point from its previous position(with the help of suffix links

8、(斷頭指標)Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b The challenges:How do we derive the position of the current growing point?Vertically(case 2)Horizontally(case 3)nQ:Which one is easier?nMoving from iteration k 1 to iteration k.nThe gro

9、wing point does not move!nThis is the easier case.Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b 1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b Case 1:長此以往Case 2:節外生枝Case 3:若無其事nMoving from Step i t

10、o Step i+1 in the same iteration.nThe growing point moves dramatically.nThis is the tougher case.Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b 1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b Case 1:

11、長此以往Case 2:節外生枝Case 3:若無其事n前人種樹後人涼的哲學n每次千辛萬苦找到vertical movement的目的時,把這個movement的起點與終點用一個link記錄下來.n下回遇到這個起點時,就可以直接走到終點去,不用再重新找一次.n這些link就叫做suffix link(斷頭指標).n終點所對應的字串,是起點所對應之字串的斷頭字串(second suffix)Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a

12、b b n每個斷頭指標的起點一定是個internal node,不會葉子 不會是某個suffix tree edge的中間nWhy?Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b n每個internal node一定是某個斷頭指標的起點nWhy?Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a

13、 a b a a b a b b S=b b a b b a a b1,-2,-3,-3,-3,37,-4,-3,37,-4,-2,37,-4,-11,1121111 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b n遇到必須節外生枝(case 2),就在長好新樹枝之後,移到斷頭位置,繼續從那裡檢查目前這個Sk是case 2還是case 3.n遇到若無其事(case 3),就直接按照Sk的方向移動,繼續從那裡檢查下一個字元Sk+1是case 2還是case 3.11,13,

14、1,12,3,127,2,34,7,14,17,3,34,1210,4,56,110,2,23,323453,3nGoing up to a closest internal node(whose suffix link must be available).Suppose this upward traversal passes through t characters.nFollowing the suffix link that starts from this internal node.ti,jnGoing down by matching the t-character subst

15、ring Si,i+t 1 of S.ti,jnNavely:O(t).nCleverly:O(1+d),where d is the number of internal nodes being went through during phase(2).ti,jnSuppose di is the d in the i-th Case-2-step traversal.nIt suffices to show d1+d2+d|S|=O(|S|).Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a

16、 a b b b a a b b a a b a a b a b b nThe slack means the distance between a position P and the closest internal node above P.ti,jnEach case-3 traversal(i.e.,horizontal movement)can only increase the value of by at most one.n(It can even decrease the value of.)Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b

17、a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b nThe i-th case-2 traversal(i.e.,vertical movement)decreases the value of by at least di.Case 2:節外生枝Case 3:若無其事1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b a a b b a a b a a b a b b nInitial =O(1).n can be increased b

18、y one for at most|S|times(because there are at most|S|horizontal movements(i.e.,case-3 traversals).nSince is always non-negative,the above bound is proved.nConstructing suffix tree in linear time.nApplications of suffix trees Substring problem(暖身)“Exact string matching”revisited Linearization of cir

19、cular string(挪移乾坤)Longest common substring(異中求同)nIntermission 小巨s magic shownSubstring Problem(recap as a warm-up)nInput:two strings P and S,where S is allowed to be preprocessed in O(|S|)time.nOutput:an occurrence of P in S.nObjective:done in O(|P|)time.1,13,1,12,3,127,2,34,7,14,17,3,34,13,31,17,2,

20、34,7,4,7,3,34,3,3Q:Where are abba,baa,bb?121,1347,2,34,57,4,67,3,34,3,33211Q:Where are abba,baa,bb?nExact String MatchingnInput:two strings P and S,where S is allowed to be preprocessed in O(|S|)time.nOutput:all occurrences of P in S.nChallenge:solving this in O(|P|+k)time,where k is the number of o

21、ccurrences of P in S.nEach internal node keeps the labels of all its descendant leaves.121,1347,2,34,57,4,67,3,34,3,3Q:Somethings missing?6,35,24,15,2,4,1Q:How do we fix this problem?121,1347,2,34,57,4,67,3,34,3,35,24,15,2,4,1,8179,4,45,89,99,6,3,7Q:Obtainable in O(|S|)time?nS=a a a a a$1654321,2,3,

22、4,51,2,3,41,2,31,2nConsider the sequence L of leaves from left to right.The descendant leaves of each internal node has to be consecutive in L.121,1347,2,34,57,4,67,3,33,35,24,15,2,4,1,879,4,45,89,99,6,3,71,34,84,56,7nCircular String Linearization(挪移乾坤)nLet挪(S,i)denote the string Si|S|S1i 1.Si挪(S,i)

23、nInput a string S.nOutput an index i that maximizes the alphabetical order of 挪(S,i).1 2 3 4 5 6 7 8挪(S,1)=b b a b b a a b挪(S,2)=b a b b a a b b挪(S,3)=a b b a a b b b挪(S,4)=b b a a b b b a挪(S,5)=b a a b b b a b挪(S,6)=a a b b b a b b挪(S,7)=a b b b a b b a挪(S,8)=b b b a b b a alet j=1;for i=2 to|S|do

24、if(挪(S,i)挪(S,j)let j=i;output j;Time complexity?1 2 3 4 5 6 7 8挪(S,1)=b b a b b a a b挪(S,2)=b a b b a a b b挪(S,3)=a b b a a b b b挪(S,4)=b b a a b b b a挪(S,5)=b a a b b b a b挪(S,6)=a a b b b a b b挪(S,7)=a b b b a b b a挪(S,8)=b b b a b b a a1 2 3 4 5 6 7 8b b a b b a a b b a b b a a b a b b a a b b b

25、a a b b a a b a a b a b b Q:How to fix the problem?nSuffix tree for SSnEach length-|S|substring of SS is a 挪(S,j)for some index j with 1 j|S|.nEach 挪(S,j)with 1 j|S|is a length-|S|substring of SS.11,13,1,12,3,127,2,34,7,14,17,3,34,1210,4,56,110,2,23,323453,31,17,4,7,4,7,3,310,4,56,10,2,23,33,3Q:How

26、to use this suffix tree?nOutput the index i such that SSi|SS|corresponds to the rightmost leaf of the suffix tree for SS.Clearly,this takes O(|S|)time.nLongest common substring(異中求同)nInput:two strings A and B.nOutput:a longest string C that occurs in both A and B.nA=b b b a b b a a bnB=b a a b b a b

27、 b a bnC=b bnC=b a a bnC=a b b anC=b b a b b abuild suffix tree for B;for L=|A|downto 1 do for i=1 to|A|-L+1 do if Aii+L-1 occurs in B output Aii+L-1 and halt;output“no common substring”;Time complexity?()3|12|1|1|)()1|(|AOAiiAOiOiAALALAL=+-=+-=nThe for-loop takes timeCan we do better than this?buil

28、d suffix tree for B;for i=1 to|A|do find the largest integer L(i)such that Aii+L(i)-1 occurs in B by binary search;output AiL(i)for the i with the largest L(i);Time complexity?nThe for-loop takes O(|A|2 log|A|)time.Each binary search takes time O(|A|log|A|).There are overall O(|A|)binary searches.Ca

29、n we do better than this?nit is impossible to solve this longest common substring problem in O(|A|+|B|)time.nin O(|A|+|B|)time via suffix treenConstruct a suffix tree T for A#B$,where#and$are two characters not in A and B.nThere are exactly|A|+|B|+2 leaves in T,each leaf corresponds to a suffix of A

30、#B$.A-leaf:with label in 1,2,|A|ncorresponds to an A-suffix.B-leaf:with label in|A|+2,|A|+|B|+1ncorresponds to a B-suffix.$#ABA-suffixB-suffixnLet v be an arbitrary position of T(i.e.,v is not necessarily a node of T.)v has a descendant A-leaf if and only if v corresponds to a prefix of an A-suffix

31、of A#B$.v has a descendant B-leaf if and only if v corresponds to a prefix of a B-suffix of A#B$.rootvnLet v be a position of T.v has descendant A-leaf and B-suffix if and only if v corresponds to a common substring of A and B.root$#ABA-suffixB-suffixvnDo we really need#to separate A and B in the co

32、ncatenated string A#B$?root$ABA-suffixB-suffixvnConstruct the suffix tree T of A#B$.nOutput the string corresponding to a deepest internal node v such that the subtree of T rooted at v contains both A-leaf and B-leaf.nQ:why not checking leaves?nQ:why not checking positions of T that are not internal

33、 nodes of T?nIf the position v contains both kinds of descendant leaves,then so does its closest internal node below.rootvnO(|A|+|B|)time for constructing T.nO(|A|+|B|)time for marking the colors of each node,including each leaf and each internal nodesnO(|A|+|B|)time for computing the depths of all

34、nodesnO(|A|+|B|)time to find a deepest internal node with both colors.nQ:Can we further improve the time and space complexity?n“No”for the time complexity.n“Yes”for the space complexity.nInput:two strings A and B.nOutput:a longest string C that occurs in both A and B.nObjective:O(|A|+|B|)time O(|A|)

35、spacenIdea:Construct the suffix tree of A only.nConstruct the suffix tree T of A,keeping all the suffix links.nFor i=1 to|B|do Find the largest integer 深(i)such that Bii+深(i)1 occurs in B.nOutput Bii+深(i)1 where i is the index with maximum 深(i).nFinding 深(i)for each i takes O(深(i)+1)time by traversi

36、ng T from the root.nBut all these|B|iterations would take O(|A|B|)time in total.depth=深(i)n深(i+1)深(i)1.深(i)深(i)1What if this suffix link does not exist?121,1347,82,34,857,84,867,83,34,83,31Record:深(1)=31Record:深(3)=4Record:深(5)=6nClearly,the space complexity is O(|A|).nThe time complexity still O(|A

37、|+|B|).We first show that the time is O(|A|+|B|)without considering that of suffix link traversal.We then show that the time for suffix link traversal is also O(|A|+|B|).nThe for-loop has exactly|B|iterations.nSuppose the i-th iteration moves the arrow to the right by d(i)units.nd(1)+d(2)+d(|B|)=|B|

38、,because the arrow never goes left.nThe i-th iteration takes time O(d(i)+1).nSo,the overall time complexity is O(|B|).nLet(i)denote the distance between the position of at the end of the i-th iteration and the closest internal node above.nLet t(i)be the number of internal nodes touched in the downward suffix link traversal of the i-th iteration.n(i)(i 1)+d(i)t(i).d(i)t(i)nt(i)(i 1)+d(i)(i)nTherefore,t(1)+t(2)+t(|B|)is at most d(1)+d(2)+d(|B|)+(0)(|B|),which is clearly O(|B|+|A|).(0)|A|,and d(1)+d(2)+d(|B|)=|B|.

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 办公、行业 > 各类PPT课件(模板)
版权提示 | 免责声明

1,本文(生物资讯相关演算法AlgorithmsinBioinformatics-课件.ppt)为本站会员(晟晟文业)主动上传,163文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。
2,用户下载本文档,所消耗的文币(积分)将全额增加到上传者的账号。
3, 若此文所含内容侵犯了您的版权或隐私,请立即通知163文库(发送邮件至3464097650@qq.com或直接QQ联系客服),我们立即给予删除!


侵权处理QQ:3464097650--上传资料QQ:3464097650

【声明】本站为“文档C2C交易模式”,即用户上传的文档直接卖给(下载)用户,本站只是网络空间服务平台,本站所有原创文档下载所得归上传人所有,如您发现上传作品侵犯了您的版权,请立刻联系我们并提供证据,我们将在3个工作日内予以改正。


163文库-Www.163Wenku.Com |网站地图|