1、Microarray and Bioinformatics基因芯片的生物信息学基因芯片的生物信息学Dr Jingfu Qiu 邱景富邱景富School of Public Health公共卫生学院公共卫生学院Aims for the Microarray BioinformaticslUnderstand basic microarray technology and its use in gene expression analysis.基因芯片技术与表达谱分析中的应用lLearn basic data analysis methods and how to apply them in th
2、e analysis of gene expression data 基因芯片的数据分析Data acquisition 数据获得Data normalization 数据归一化Data analysis 数据分析Data Clustering 数据聚类Vocabulary-Review 回顾回顾lGene 基因基因:hereditary DNA sequence at a specific location on chromosome.lGenetics 遗传学遗传学:study of heredity&variation in organisms.lGenome 基因组基因组:an org
3、ans total content(full DNA sequence)lGenomics 基因组学基因组学:study of organisms in terms of their genome.l2002年年2月月12日日,历时历时10载耗资载耗资20亿美元的人类基因组计划最终完成亿美元的人类基因组计划最终完成,并报并报道了道了99%的人类基因组序列的人类基因组序列.Vocabulary-Review回顾回顾lProtein 蛋白质蛋白质:sequence of amino acids that“does something”lProteomics 蛋白质组学蛋白质组学:study of
4、all of the proteins that can come from an organisms genomelBioinformatics 生物信息学生物信息学:the collection,organization&analysis of large-scale,complex biological data.lFunctional Genomics 功能基因组学功能基因组学:study of obtaining an overall picture of genome functions,including the expression profiles at the mRNA l
5、evel and the protein levelMicroarray 基因芯片基因芯片 A high throughput technology that allows detection of thousands of genes Simultaneously gene chip,biochip,array Much rely on computer aids Central platform for functional genomicsTypes of Microarrays 芯片的种类芯片的种类lDNA microarrays,such as cDNA microarrays an
6、d oligonucleotide microarrays lMMChips,for surveillance of microRNA populations lProtein microarrays lTissue microarrays lCellular microarrays(also called transfection microarrays lChemical compound microarrays lAntibody microarraysTypes of DNA Microarrays1.cDNA chip(DNA microarray,two-channel array
7、)cDNA芯片芯片:Probe cDNA(5005,000 bases long)is immobilized to a solid surface such as glass Using robot spotting Traditionally called DNA microarray Firstly developed at Stanford University2.Gene chip(DNA chip,Affymetrix chip)基因芯片基因芯片:Oligonucleotide(2080-mer oligos)is synthesized either in situ(on-chi
8、p)or by conventional synthesis followed by on-chip immobilization Historically called DNA chips Developed at Affymetrix,Inc.,under the GeneChip trademark Many companies are manufacturing oligonucleotide based chips using alternative technologiesHistory 历史历史lHGP(human genome project):suggested by Del
9、becco on Mar.7,1986,started in Oct.1990,rapid and sensitive techniques for human genome information analysisl80S:suggestion based on computer chip,W Brains tried it firstly.l90S:Stephen Fodor(Present of Affymetrix now)made it successfully.l1995:Quantitative monitoring of gene expression patterns wit
10、h a complementary DNA microarray lEnd of 1996:the first DNA chip Microarrays are Popular 芯片技术的普及芯片技术的普及lAt NYU Med Center now collecting about 3 GB of microarray data per week(60 chips,6-10 different experiments)lPubMed search microarray=24,431 papersWhat problems can it solve?基因芯片的应用基因芯片的应用 Differi
11、ng expression of genes over time,between tissues,and disease states 基因表达差异基因表达差异 Identification of complex genetic diseases 复杂性基因疾病的诊断复杂性基因疾病的诊断 Drug discovery and toxicology studies 药理与毒理学研究药理与毒理学研究 Mutation/polymorphism detection(SNPs)SNP 检测检测 Pathogen analysis 诊断病原诊断病原Features 特点特点 Parallelism 高平
12、行高平行 Thousands of genes simultaneously Miniaturization 小型化小型化 Small chip size Multiplexing 高通量高通量 Multiple samples at the same time Automation 自动化自动化 Chip manufacturing ReagentsDifferential Gene Expression基因表达差异基因表达差异A Few Examples:lCell type specific -e.g.skin cell vs.brain cell lDevelopmental stag
13、e -e.g.embryonic skin cell vs.adult skin celllDisease state -e.g.normal skin cell vs.skin tumor celllEnvironment-specific -e.g.skin cell untreated vs.treateddrugs,toxinsWhat is its pitfall 缺陷与不足缺陷与不足?Detect transcription mRNA level,not translation protein level Many factors(variations)can affect the
14、 result:影响因素众多 Chip and probe design Experiment design Sample preparation Image acquisition Data normalization Data analysis .Success crucial 成功关键:You know both the biology problem and the computer aids(software,statistics).RequrimentslArray spotter 点样仪lArray scanner 扫描仪lChemistry systems 杂交体系lSoftw
15、ares 软件 Market predict 市场预期市场预期At 1999:1 billion USDLess than 5 yrs:20 billions2005:5 billions(USA)2010:40 billions(USA)Dont include disease diagnosticThe largest industry instead of microelectricsPrinciple 原理原理 Similar to Northern Base-Pairing,hybridization between nucleic cids Major differences fr
16、om Northern Detects thousands of genes simultaneously/individual Probes fixation on glass slide/nylon membrane Target samples labeling with fluorescent/radioactive dNTPDesigning the Probes 探针的设计探针的设计lThe probes need to be of high specificity to avoid hybridization with wrong target molecules.特异性lThe
17、 probes need to generate an output that is easy to read(spots lie in defined positions and be of regular size and shape and even spacing).杂交结果容易判读lThe probes have to have high sensitivity to detect the mRNA and the intensity of the spot light must be differentiable from background noise.敏感性lResults
18、must be reproducible across multiple experiments.重复性Spotting Process 点样过程点样过程点样针点样针Spot robot 点样仪点样仪Cheung et al.1999Affymetrix 基因芯片基因芯片表达差异检测表达差异检测Comparison of Probe Types两种探针比较两种探针比较AdvantagesNo need to isolate and purify cDNAs because oligonucleotides can be synthesized.Short oligonucleotides ar
19、e less likely to have cross-reactivity with other sequences in the target DNA.Density of chips is higher than with cDNAs.LimitationsThe sequence has to be known.Synthesis can be expensive and time-consuming.The short sequences are not as specific for target DNA,so appropriate controls must be added.
20、In-situ Synthesis/OligosPCR Products/cDNA ProbesAdvantagesFlexibility to study cDNAs from any source.cDNAs do not require any a priori information about the corresponding genes.Longer sequences increase hybridization specificity,which reduces false positives.LimitationsIsolation of individual cDNAs
21、to immobilize on each spot can be cumbersome.Density is lower than synthesizing oligonucleotides on the surface of the chip.cDNAs are longer sequences and are more likely to randomly contain sequences found in target DNA,which results in cross-reactivity.Many other variations of the technology exist
22、,such as the use of longer oligos,the use of fibre optics,etc.lHomemadelTailoredlCheaper?lMaximum 24,000 features per arraylProne to variabilitylCommercially availablel“Off the rack”lMore expensive?lMaximum 500,000 features per arraylLess variabilitySpotted ArraysAffymetrix ArraysProcess of manufact
23、ure a microarray芯片制备流程芯片制备流程lStart with individual genes,e.g.the 4,200 genes of the genome or Y.pestislAmplify all of them using polymerase chain reaction(PCR)l“Spot”them on a medium,e.g.an ordinary glass microscope slidelEach spot is about 100 m in diameterlSpotting is done by a robotlComplex and p
24、otentially expensive taskB21B22B23B24B25B26B27B28B29B30B31B32B17B18B19B20B5B6B7B8B9B10B11B12B13B14B15B16B1B2B3B448矩阵矩阵1717 点阵点阵一共一共84488448个点;个点;40054005条鼠疫菌基因条鼠疫菌基因+若干对照若干对照DNADNA;每每样品相邻重复两个点。样品相邻重复两个点。基因选择基因选择40154015条条芯片点样芯片点样基因的基因的PCR扩增扩增产物纯化产物纯化和和浓缩浓缩4005条基因条基因全全基基因因组组芯芯片片研研制制引物设计引物设计Biological
25、 questionDifferentially expressed genesSample class prediction etc.TestingBiological verification and interpretationMicroarray experimentEstimationExperimental designImage analysisNormalizationClusteringDiscriminationR,G16-bit TIFF files(Rfg,Rbg),(Gfg,Gbg)Microarray Steps 基因芯片分析过程基因芯片分析过程 Experiment
26、 and Data Acquisition 实验过程与数据获得实验过程与数据获得 Chip manufacturing 芯片制备芯片制备 Sampling and labeling 点样点样 Hybridization 杂交杂交 Image scaling 图像扫描图像扫描 Data acquisition 数据获得数据获得 Data normalization 数据归一化数据归一化 Data analysis 数据分析数据分析 Biological interpretation 生物学解释生物学解释Reading an array(cont.)BlockColumnRowGene NameR
27、edGreenRed:Green Ratio111tub12,3452,4670.95112tub23,5892,1581.66113sec14,1091,4692.80114sec21,5003,5890.42115sec31,2461,2580.99116act11,9372,1040.92117act22,5611,5621.64118fus12,9623,0120.98119idp23,5851,2092.971110idp12,7961,0052.781111idh12,1704,2450.511112idh21,8962,9960.631113erd11,0233,3540.311
28、114erd21,6982,8960.59Color Coding扫描结果扫描结果lTables are difficult to readlData is presented with a color scalelCoding scheme:Green=repressed(less mRNA)gene in experimentRed=induced(more mRNA)gene in experimentBlack=no change(1:1 ratio)lOrGreen=control condition(e.g.aerobic)Red=experimental condition(e.
29、g.anaerobic)lWe only use ratioNoise 干扰干扰lNoise sources干扰来源:Sample preparation,labeling,amplificationReaction variationsEnvironmentTarget volumeHybridization parameters(temperature,time,.)Aspecific hybridizationDustScanner settingsQuantizationOther Image Processing Problems Spot Quality ProblemsUneve
30、n grid positionsCurves within a gridVariable Spot size or shapeVariable Distance between spotsTypical Problems of Raw OutputTwo slidesP04 vs.P01(pg2)A1 vs.P01(pg2)Noise filtering 干扰过滤干扰过滤Noise filtering 干扰过滤干扰过滤lGridding:identify spot locationslSegmentation:distinguish foreground from backgroundFixe
31、d Circle:put a circle around the foreground areaSeeded region growing:identify initial spot“seeds”and grow high intensity regionsEdge detection algorithmslBackground cancellationIntensity=FGintensity-BGintensityNormalization 归一化归一化lThe word normalization describes techniques used to suitably transfo
32、rm the data before they are analysed.lGoal is to correct for systematic differencesbetween samples on the same slide,orbetween slides,which do not represent true biological variation between samples.Normalization 归一化归一化lNoralize data to correct for artificial varianceslRed=FGred-BGredlGreen=FGgreen
33、BGgreenlPixelValue=log2(Red/Green)-log2(Redavg/Greenavg)lPixel color:Green if pixel value 0Normalization 归一化归一化Calibrated,red and green equally detectedUncalibrated,red light under detectedThe origin of systematic differences系统误差的产生原因系统误差的产生原因lSystematic differences due to Dye biases,which vary with
34、 spot intensity,Location on the array,Plate origin,Printing quality which may vary between lPinslTime of printingScanning parameters,DNA array Data AcquisitionDNA 芯片数据的获得芯片数据的获得Image Analysis software packages exist for the analysis of the output of custom made chips(e.g.GenePix Pro,Array Vision,TIG
35、R Spot Finder,etc)Need chip description file(CDF)For probe locationIntroduction of Software-SAMSAM软件介绍软件介绍lSignificance Analysis of MicroarrayslTusher,Tibshirani and Chu(2001):Significance analysis of microarrays applied to the ionizing radiation response.PNAS 2001 98:5116-5121,(Apr 24).Excel plugin
36、FreePermutation basedMost published method of microarray data analysislchose =.5.producing about 65 significant genes and about 5.9 false positives on the average.lThe choice of is up to the user,depending how many false positives he/she is comfortable with.lThe False Discovery Rate(FDR)is computed
37、as median(or 90th percentile)of the numberof falsely called genes divided by the number of genes called significant.Handling Missing Data 丢失数据的操作lThere are currently two options for imputing missing values in SAM.lRow Average Each value is imputed with the average of non-missing values for that gene
38、.lK-Nearest Neighbor In the other(default)option-missing values are imputed using a k-nearest neighbor average in gene space(default k=10):Clustering 聚类软件聚类软件Hypothesis:Genes with similar function have similar expression profileslFind group of genes with similar expression profileslFind groupd of in
39、dividuals with similar expression profiles within a populationClustering=Group identificationClustering Steps 聚类分析步骤聚类分析步骤lChoose a similarity metric to compare the transcriptional response or the expression profiles:Pearson CorrelationSpearman CorrelationEuclidean Distance特征抽取和模式表示lChoose a cluster
40、ing algorithm:HierarchicalK-meansCluster algorithm聚类算法聚类算法 -Unsupervised Analysis 非监督算法 -HierarchicalK-meanSelf-organizing mapsOthers -Supervised Analysis:classification rules 监督算法Hyerarchical Clustering ExampleEisen et al.(1998),PNAS,95(25):14863-14868Hyerarchical Clustering Examplehttp:/www.pnas.o
41、rg/cgi/content/full/95/25/14863系统聚类法步骤系统聚类法步骤1、将n个样品各作为一类;2、计算n个样品两两之间的距离,构成距离矩阵;3、合并距离最近的两类为一新类;4、计算新类与当前各类的距离。再合并、计算,直至只有一类为止;5、画聚类树形图,确定距离切点、类组,解释。在SPSS软件中的操作步骤:Analyze-Classify-Hierarchical Hierarchical Clustering系统聚类法系统聚类法g1g2g3g4g5g10.230.000.95-0.63g20.910.560.56g30.320.77g4-0.36g5g1g4g1g2g3g
42、4g5g10.230.000.95-0.63g20.910.560.56g30.320.77g4-0.36g5 Find largest value is similarity matrix.Join clusters together.Recompute matrix and iterate.Hierarchical Clustering 系统聚类系统聚类g1,g4g2g3g5g1,g40.370.16-0.52g20.910.56g30.77g5g1g4g2g3g1,g4g2g3g5g1,g40.370.16-0.52g20.910.56g30.77g5 Find largest valu
43、e is similarity matrix.Join clusters together.Recompute matrix and iterate.Hierarchical Clustering系统聚类系统聚类g1,g4g2,g3g5g1,g40.27-0.52g2,g30.68g5g1g4g2g3g5g1,g4g2,g3g5g1,g40.27-0.52g2,g30.68g5 Find largest value is similarity matrix.Join clusters together.Recompute similarity matrix and iterate.Interp
44、reting the Resultsg1g4g2g3g52 clusters?3 clusters?k-means 聚类分析lk-means 聚类分析是一种广为人知的方法,它通过尽量缩小一个分类中的项之间的聚类分析是一种广为人知的方法,它通过尽量缩小一个分类中的项之间的差异,同时尽量拉大分类之间的距离,来分配分类成员身份。差异,同时尽量拉大分类之间的距离,来分配分类成员身份。k-means 中的中的 means 指的是分类的指的是分类的“中点中点”,它是任意选定的一个数据点,之后反复优化,直,它是任意选定的一个数据点,之后反复优化,直到真正代表该分类中的所有数据点的平均值。到真正代表该分类中的
45、所有数据点的平均值。k 指的是用于为聚类分析过程设种指的是用于为聚类分析过程设种子的任意数目的点。子的任意数目的点。k-means 算法计算一个分类中的数据记录之间的欧几里得距离算法计算一个分类中的数据记录之间的欧几里得距离的平方,以及表示分类平均值的矢量,并在和达到最小值时在最后一组的平方,以及表示分类平均值的矢量,并在和达到最小值时在最后一组 k 分类上收分类上收敛。敛。lk-means 算法仅仅将每个数据点分配给一个分类,并且不允许成员身份存在不确定算法仅仅将每个数据点分配给一个分类,并且不允许成员身份存在不确定性。分类中的成员身份表示为与中点的距离。性。分类中的成员身份表示为与中点的距
46、离。l通常,通常,k-means 算法用于创建连续属性的分类,在这种情况下,计算与平均值的距算法用于创建连续属性的分类,在这种情况下,计算与平均值的距离非常简单。但是,离非常简单。但是,Microsoft 实现通过使用概率针对分类离散属性对实现通过使用概率针对分类离散属性对 k-means 方方法进行改编。对于离散属性,数据点与特定分类的距离按如下公式计算:法进行改编。对于离散属性,数据点与特定分类的距离按如下公式计算:l1-P(数据点数据点,分类分类)Recommended TextslGeneral overview of microarray data analysis“Microarr
47、ay Gene Expression Data Analysis:A Beginners Guide”(Causton,Quakenbush and Brazma)“Microarray Bioinformatics”(Stekel)lData Mining“Data Mining:Concepts and Techniques”(Han)Affymetrix Michael Eisen Lab at LBL(hierarchical clustering software“Cluster”and“Tree View”(Windows)rana.lbl.gov/Stanford MicroAr
48、ray Database(“Xcluster”(Linux)genome-www4.stanford.edu/MicroArray/SMD/Review of Currently Available Microarray Softwarewww.the- DB www.biologie.ens.fr/en/genetiqu/puces/bddeng.htmlSome useful linksEisen,M.B.et al.,(1998).Cluster analysis and display of genome-wide expression patterns.Proc Natl Acad
49、Sci U S A 95(25):14863-8.Wen,X.,et al.,(1998).Large-scale temporal gene ex-pression mapping of central nervous system development.Proc Natl Acad Sci U S A 95(1):334-9.U.Alon,et al.,(1999)“Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.”PNAS,96:6745-6750,June 1999.Spellman,P.T.et al.,(1998).Comprehensive identification of cell cycle-regulated genes of the yeastSaccharomyces cerevisiae by microarray hybridization.”Mol Biol Cell 9(12):3273-97Some papersThanks for your attention!