1、位置依存行列例推定阿久津達也takutsukuicr.kyoto-u.ac.jp京都大学化学研究所位置依存行列A T C G C A T T T A C T T G G C T G T T SEQ1 SEQ2 SEQ3 SEQ4 A A A T G C A A:3.5 C:-2.5 G:-3.5 T:4.0 A:-1.0 C:2.8 G:3.5 T:-1.5 A:2.0 C:1.8 G:-2.0 T:-3.0 PSSM Threshold:score=8.0 SEQ1 G C T G C A SEQ4 関数推定問題定義n入力n正例:X1,X2,X3,n負例:Y1,Y2,Y3,n出力:以下満 関
2、数nX1,X2,X3,最適(、値以上)nY1,Y2,Y3,非最適(値以下)関数推定関理論的結果最適化/同定関数学習PP (ICALP98)HardHard (ICALP98)RNA二次構造予測PP (ICALP98)質HardHard (ICALP98)疎水性指標PP (new)位置依存行列 PHard (new)混合分布PHard (new)位置依存行列推定n入力:nPOS (上文字列集合、正例)nNEG (上文字列集合、負例)nL(領域長)n出力:PSSM f 値 s.t.nFor all S in POS,there is a substring S of S for which f(S
3、)=nFor all S in NEG,for all substring S of S,f(S)=8.0 SEQ1 G C T G C A SEQ4 PSSM,Score,Linear InequalitynPSSM:f(S)nf(S)=fi(Si)nwhereS=S1SmnXi,a=fi(a),S=TGC f(S)=X1,T+X2,G+X3,C=領域既知線形判別簡単 A:3.5 C:-2.5 G:-3.5 T:4.0 A:-1.0 C:2.8 G:3.5 T:-1.5 A:2.0 C:1.8 G:-2.0 T:-3.0 PSSM f 1 f 2 f 3 G C T S=f(S)=(G)+(
4、C)(T)+f 1 f 3 f 2 NP-困難証明nNP-hard if L is not bounded(|=2)nReduction from 3SATn3SAT:C=c1,c2,cn over X=x1,xn nS(i,j,):string of length 4n nSi=1,Sj=1,Sk=0 for the other position hnNEG=S(),S(4n),S(i,j),S(i),S(2i-1,2i,4n)nPOS=S(g(i1),4n)S()S(g(i2),4n)S()S(g(i3),4n)for clause c=li1 or li2 or li3 where g(
5、ik)=2ik-1 if li1 is positive literal,otherwise g(ik)=2ik0 0 0 .0 X 1 X n X 1 X n 4n 2n-1 2n S(.)NEG S()0 0 0 .0 S(4n)0 0 0 .1 S(i,j)0 0 0 .0 1 1 S(i)0 0 0 .0 S(2i-1,2i,4n)POS Xi Xj Xk 1 1 1 1 1 1 S(2i-1,4n)S(2j-1,4n)S(2k-1,4n)S()S()(for clause Xi Xj Xk or or)1 1 Xi Xi L固定場合多項式時間nConstruct an arrange
6、ment of hyperplanes in(|L+1)-dimensional Euclidean space for the hyperplanes:n=f(S)for each substring S of length L of each sequence in POS U NEG nCheck each cell in the arrangementApplicable to derivation of hydropathic indices because f1=f2=fL.Arrangement of HyperplanesnCombinatorial and Computati
7、onal Comexities:O(nd)for n planes in d-dimensionsnThe sign of y-fi(x)does not change within each celly=f3(x)y=f2(x)y=f1(x)y=f4(x)L固定場合補足n、L固定場合、考慮文字列個数定数個多項式時間trivialn位置依存行列(例、疎水性指標)場合、文字列個数定数個無(行列)n疎水性指標学習:通常膜貫通領域領域既知n今回結果領域既知無推定可能示唆PSSMMixture学習n入力:POS,NEG,N(#PSSM)全配列同長n出力:以下満N個PSSM組(f 1,f N)nPOS中全配列S、PSSM f k 存在、f k(S)=nNEG中全配列S,全PSSM f k、f k(S)=for each S in POSnf(S)=、score(Yi,Yi)n既存手法n頻度基(PAM、BLOSUM)n最適化基手法(Goldstein,蓬来)n問題点:必要n結果n文字数(残基数)制約無場合、NP困難PSSM関nPSSM固定多項式時間n固定NP困難n2個PSSMMixture NP困難n課題n実用的行列推定法開発n学習対分類最小化実際実行可能解無