1、Application of Support Vector Machine to detect an association between a disease or trait and multiple SNP variationsAuthor:Gene Kim,MyungHo KimAdvisor:Dr.HsuGraduate:Ching-Wen HongOutline 1.Motivation 2.Objective 3.Whats SNP(single nucleotide polymorphism)4.How to find SNP variations 5.A review of
2、Support Vector Machine 6.A representation of multiple SNP variations as a vector 7.The marks 8.Inseparable Case 9.Test results with clinical data 10.Personal opinionMotivation 研究每個人的單一核甘酸多型性(SNP)的差異,可以幫助了解致病基因,甚至預測藥物對個人是否具有療效,進一步設計量身訂做藥物,對新藥的開發有極大的影響。SNP的研究是後基因時代生技產業發展的主要趨勢。Objective We can present
3、a method of detecting whether there is an association between multiple SNP variations and a trait or disease.The method exploits the Support Vector Machine(SVM)which has been attracting lots of attentions recently.Whats SNP 何謂SNP(單一核甘酸多型性)雖然同種生物其染色體差異極小,但平均1000個鹼基對(base pair)就有一個發生突變,這些變異稱為SNP,是造成每個
4、人對藥物的敏感性不同、血型不同、身高 等等的原因。此外,SNP也和癌症、心血管疾病、自體免疫等等疾病有關。目前國內賽亞基因和台大醫院合作,正從事C型肝炎SNP研究,試圖找出病患的SNP,以預測藥物是否對病人有效。Whats SNP A genetic marker is M1,M2,in the DNA The different variants of DNA that different people have at the marker are alleles,denoted by 1,2,3.,The number of alleles per marker is small:typi
5、cally less than ten(for called microsatellite marker)or exactly two(for called SNPs).How to find SNP variations The problem of determining whether a set of SNP variation cause a specific disease or trait could be formulated as follows.For a given disease or trait,1.For each set of SNP variations,fin
6、d its representation as a vector in a Euclidean space.(haplotype data,clinical data,.we will discuss this in the page9)2.Get a systematic way of distinguishing SNP genotype of normal people from ones of people with the disease or trait.We will use the Support Vector Machine(SVM)to separate SNP vecto
7、rs into two groups(normal,sick).A review of Support Vector Machine What is a SVM?a family of learning algorithm for classification of objects into two classes.Input:a training set(x1,y1),(xl,yl)of object xi E(n-dim vector space)and their known classes yi E-1,+1.Output:a classifier f:-1,+1.which pred
8、icts the class f(x)for any(new)object x E A review of Support Vector Machine(1).Linear SVM for separable training sets:a training set S=(x1,y1),(xl,yl),xiE,yi E-1,+1.A review of Support Vector Machine The optimal hyperplane is defined by the pair(w,b).Solve the linear program problem Min w st.yi(xiw
9、+b)-10 ,i=1,l This is a class quadratic(convex)programA review of Support Vector Machine(2).Linear SVM for non-separable training sets Solve the linear program problem Min w+C(i),c is a extreme large value S.t.yi(xiw+b)-1+i 0 ,i 0,0ic,i=1,lA representation of multiple SNP variations as a vector Sche
10、me Given each disease or trait,and a collection of SNP data which depending on genotype in a consistent way.(haplotype,clinical data):7 step 1.Assume that there is no environmental factor.2.SNP locations are assumed to be know for the disease or trait.3.Assume there is a reference SNP data.(good hea
11、lth records)4.By giving scores based on difference from the reference data,assign a vector to each SNP data.A representation of multiple SNP variations as a vector The dimension of vector is the number of SNPs to the related disease or trait.5.A training set is chosen for the disease or trait,in oth
12、er words,SNP genotype data of normal and sick population.6.By using Step 4,compute the SNP vectors of the training data set(xi,yi),xi is a SNP data,yi=1(sick)or -1(normal),7.Use the SVM to get a hyperplane dividing into two groups(sick,normal)The remarks1.The reference data can be built by collectin
13、g SNP genotypes from the healthy normal population.2.The hyperplane obatined can be considered as acriterion,and,given a new data set,it can be used for testing whether the person of the data is susceptible to the disease or trait.3.Representation of an object as a vector might be critical for makin
14、g use the SVM.How to make domain knowledge contained in vector representations is one of the major issues.4.The idea of difference scoring could be applied to other data sets(visual data such as X-ray or MRI image,),in particular,to haplotype data and to find out a linkage among SNP to the disease o
15、r trait.5.Once a group of SNP patterns are identified,it can compute contribution score of each of those SNP to the disease or trait.Inseparable CaseFor the inseparable case,the iterated use of SVM enables us to divide a collection of labelled of vectors into several clustering groups.1.Set a thresh
16、old value.Say,80%.2.Use SVM to separate a collection of labelled of vectors into two groups A,B.3.Check if the groups contain more than 80%of either 1 or-1 labeled vectors.Suppose A is not such one.Then use SVM to A again to two subgroups.4.Repeat this procedure until each subgroup has a majority of
17、 more than 80%.5.For each subgroup,figure out a range.Test results with clinical dataThe clinical data is a cardio-patient records data set:Height,age,sex,weight,etnic background,medical history,birth place,blood pressure(systolic and diastolic),Liqid measurements etc are numericalized and+1:a patient with heart attack,stroke or heart failure,otherwise-1We used Thorsten Joachims implementation of SVM.17Personal opinion Application of SVM is effective,But it is difficult to solve nonlinear problem.How to make domain knowledge contained in vector representations is one of the major issues.