多元统计分析课件.ppt_163文库

资源描述

1、Preface to the 1st EditionMost of the observable phenomenafinmin in the empirical(empirikl经验)sciences are of a multivariate nature.In financial studies,assets in stock markets are observed simultaneously and their joint development is analyzed to better understand general tendencies（趋势）and to track

2、indices（路灯）.The underlying theoretical structure of these and many other quantitative studies of applied sciences is multivariate.This book on Applied Multivariate Statistical Analysis presents the tools and concepts of multivariate data analysis with a strong focus on applications.The aim of the bo

3、ok is to present multivariate data analysis in a way that is understandable for non-mathematicians and practitioners who are（面对）by statistical data analysis.This is achieved by focusing on the practical relevance and through the e-book character of this text.All practical examples may be recalculate

4、d and modified by the reader using a standard web browser and without reference or application of any specific software.Most of the observable phenomenafinmin in the empirical(empirikl经验)sciences are of a multivariate nature.The underlying theoretical structure of these and many other quantitative s

5、tudies of applied sciences is multivariate.This book on Applied Multivariate Statistical Analysis presents the tools and concepts of multivariate,mlti vereit data analysis with a strong focus on applications.The book is divided into three main parts.The first part is devoted to graphical techniques

6、describing the distributions of the variables involved.The second part deals with multivariate random variables and presents from a theoretical point of view distributions,estimators and tests for various practical situations.The last part is on multivariate techniques and introduces the reader to t

7、he wide selection of tools available for multivariate data analysis.All data sets are given in the appendix and are downloadable from www.md-.The text contains a wide variety of exercises the solutions of which are given in a separate textbook.In addition a full set of transparencies on www.md- is p

8、rovided making iteasier for an instructor to present the materials in this book.All transparencies contain hyper links to the statistical web service so that students and instructors alike may recompute all examples via a standard web browser.1-2 week UNIT-I Descriptive Techniques(描述技术描述技术)1 Compari

9、son（对照）（对照）of Batches 1.1 Boxplots 4 1.2 Histograms 10 1.3 Scatterplots 17 1.4 Data Set-Boston Housing 351 Comparison of BatchesMultivariate statistical analysis is concerned with analyzing and understanding data in high dimensions.We suppose that we are given a set xini=1 of n observations of a var

10、iable vector X in Rp.That is,we suppose that each observation xi has p dimensions:xi=(xi1,xi2,.,xip),and that it is an observed value of a variable vector X Rp.Therefore,X is composed of p random variables:X=(X1,X2,.,Xp)where Xj,for j=1,.,p,is a one-dimensional random variable.1 Comparison of Batche

11、sMultivariate statistical analysis is concerned with analyzing and understanding data in high dimensions.How do we begin to analyze this kind of data?Before we investigate questions on what inferences we can reach from the data,we should think about how to look at the data.This involves descriptive

12、techniques.Questions that we could answer by descriptive techniques are:Are there components of X that are more spread out than others?Are there some elements of X that indicate subgroups of the data?Are there outliers in the components of X?How“normal”is the distribution of the data?1.1 Boxplots1 C

13、omparison of BatchesGenuinedenjuin真正的真正的X6X1The median and mean bars are measures of locations.The relative location of the median(and the mean)in the box is a measure of skewness.The length of the box and whiskers are a measure of spread.The length of the whiskers indicate the tail length of the di

14、stribution.The outlying points are indicated with a“”or“”depending on if they are outside of FUL 1.5dF or FUL 3dF respectively.The boxplots do not indicate multi modality or clusters.If we compare the relative size and location of the boxes,we are comparing distributions.SummaryReading material21.da

15、ta capacity21.data capacity数据容量数据容量kpsiti22.data handling22.data handling数据处理数据处理hndli23.data reduction23.data reduction数据缩减分析数据缩减分析ridkn24.data transformation24.data transformation数据变换数据变换25.density function25.density function密度函数密度函数26.description26.description描述描述27.descriptive27.descriptive描述性的描

16、述性的28.deviation from average28.deviation from average均值离差均值离差,di:viein背离背离29.29.DfDf.Fit.Fit拟合差值拟合差值30.df.(degree of freedom)30.df.(degree of freedom)自由度自由度31.distribution shape31.distribution shape分布形状分布形状eip32.double logarithmic32.double logarithmic双对数双对数,l:grimik33.eigenvector33.eigenvector特征向量特征

17、向量aign,vekt(r)34.error of estimate34.error of estimate估计误差估计误差estimeit35.estimation35.estimation估计量估计量estimein重音差别重音差别36.Euclidean distance36.Euclidean distance欧式距离欧式距离ju:klidin37.expected value37.expected value期望值期望值ikspektid38.experimental sampling38.experimental sampling实验抽样实验抽样ik,sperimentl s:mp

18、li39.explanatory variable39.explanatory variable说明变量说明变量iksplntrivribl40.explore Summarize40.explore Summarize探索探索摘要摘要ikspl:smraiz1.2 Histogramsh=0.4DiagonalHistograms are density(denst)(密度密度)estimates(estimeits概算概算).A density estimate gives a good impression of the distribution of the data.In contr

19、ast to boxplots,density estimates show possible multimodality(多模式；综合多模式；综合,mltimdliti)of the data.The idea is to locally represent the data density by counting the number of observations in a sequence of consecutive（连续的连续的）intervals(bins)（箱箱）with origin（rn起源起源、原点、原点）x0 .Let Bj(x0,h)denote(dinut,指示指示

20、,表示表示)the bin of length h which is the element of a bin grid starting at x0:Bj(x0,h)=x0+(j 1)h,x0+jh),j Z,where.,.)(square brackets)denotes a left closed and right open interval(ntrvl 间隔间隔,右开区间右开区间).If xin i=1 is an i.i.d.sample with density f,the histogram is defined as follows:In sum(1.7)the first

21、 indicator function I xi Bj(x0,h)counts the number of observations falling into bin Bj(x0,h).The second indicator function I is responsible for“localizing”（luklizi局限）the counts around x.The parameter h is a smoothing or localizing parameter and controls the width(wid)of the histogram bins.An h that

22、is too large leads to very big blocks and thus to a very unstructured histogram.On the other hand,an h that is too small gives a very variable estimate with many unimportant peaks.H=0.1H=0.2H=0.3Diagonaldaignladj.对角线对角线的的,斜的斜的 n.对角线对角线,斜线斜线H=0.4The effect of h is given in detail in Figure 1.6.It con

23、tains the histogram(upper left)for the diagonal of the counterfeit bank notes for x0=137.8(the minimum of these observations)and h=0.1.Increasing h to h=0.2 and using the same origin,x0=137.8,results in the histogram shown in the lower left of the figure.This density histogram is somewhat smoother d

24、ue to the larger h.The binwidth is next set to h=0.3(upper right).From this histogram,one has the impression that the distribution of the diagonal is bimodal with peaks at about 138.5 and 139.9.The detection of modes requires a fine tuning of the binwidth.Using methods from smoothing methodology(med

25、ldi，n.方法学方法学)one can find an“optimal”binwidth h for n observations:counterfeitkauntfitadj.假冒的假冒的,假装的假装的In Figure 1.7,we show histograms with x0=137.65(upper left),x0=137.75(lower left),with x0=137.85(upper right),and x0=137.95(lower right).All the graphs have been scaled equally on the y-axis to all

26、ow comparison.One sees thatdespite the fixed binwidth hthe interpretation is not facilitated(fsiliteitid vt.使容易使容易).The shift of the origin x0(to 4 different locations)created 4 different histograms.This property of histograms strongly contradicts the goal of presenting data features.Modes of the de

27、nsity are detected with a histogram.Modes correspond to strong peaks in the histogram.Histograms with the same h need not be identical.They also depend on the origin x0 of the grid.The influence of the origin x0 is drastic.Changing x0 creates different looking histograms.The consequence of an h that

28、 is too large is an unstructured histogram that is too flat.A bin width h that is too small results in an unstable histogram.There is an“optimal”h=(24 /n)1/3.It is recommended to use averaged histograms.They are kernel densities.Summary1.4 ScatterplotsScatterplots are bivariate or trivariate plots o

29、f variables(vribl)against each other.They help us understand relationships among the variables of a data set.A downward-sloping(slupi)scatter indicates that as we increase the variable on the horizontal axis,the variable on the vertical axis decreases (di:kri:s vt.减少减少).An analogous(nlgs adj.类似的类似的)

30、statement can be made for upward-sloping scatters.Figure 1.12 plots the 5th column(upper inner frame)of the bank data against the 6th column(diagonal).The scatter is downward-sloping.As we already know from the previous section on marginal comparison a good separation between genuine and counterfeit

31、 bank notes is visible for the diagonal variable.The sub-cloud in the upper half(circles)of Figure 1.12 corresponds to the true bank notes.As noted before,this separation is not distinct(adj.清楚的、明显清楚的、明显),since the two groups overlap(,uvlp vt.重叠重叠)somewhat.Draftman绘图员 Scatterplots in two and three d

32、imensions helps in identifying separated points,outliers or sub-clusters.Scatterplots help us in judging positive or negative dependencies.Draftman scatterplot matrices help detect structures conditioned on values of other variables.As the brush of a scatterplot matrix moves through a point cloud,we

33、 can study conditional dependence.Summary1.8 Data Set Boston Housing Data SetVariablevribladj.可变的可变的,易易变的变的,不定的不定的n.变量变量,可变物可变物 First Step：New Words第一类第一类高频词高频词 160个个1.absolute deviation1.absolute deviation绝对离差绝对离差bslu:t,di:viein2.absolute residuals2.absolute residuals绝对残差绝对残差rezidju:l3.among group

34、s3.among groups组间组间gru:p4.analysis of correlation4.analysis of correlation相关分析相关分析nlsis,krlein5.analysis of covariance5.analysis of covariance协方差分析协方差分析kuvrins6.analysis of regression6.analysis of regression回归分析回归分析rigren7.Bayesian estimation7.Bayesian estimationBeyes Beyes 估计估计beisestimein8.8.bivar

35、iatebivariate双变量的双变量的baivriit9.bivariate Correlate9.bivariate Correlate二变量相关二变量相关10.boxplot10.boxplot箱线图箱线图11.canonical correlation11.canonical correlation典型相关典型相关knnikl12.categorical variable12.categorical variable分类变量分类变量,ktigriklvribl13.central tendency13.central tendency集中趋势集中趋势sentrltendnsi14.c

36、hance statistics14.chance statistics随机统计量随机统计量tns;t:ns sttistiks15.chance variable15.chance variable随机变量随机变量16.classified variable16.classified variable分类变量分类变量klsifaid17.coefficient of skewness17.coefficient of skewness偏度系数偏度系数kuifintskju:nes18.confidence limit18.confidence limit置信限置信限knfidnslimit1

37、9.cumulative probability19.cumulative probability累计概率累计概率kju:mjultiv,prbbiliti20.curvature20.curvature曲率曲率k:vt21.data capacity数据容量22.data handling数据处理23.data reduction数据缩减分析24.data transformation数据变换25.density function密度函数26.description描述27.descriptive描述性的28.deviation from average离均差29.Df.Fit拟合差值30.

38、df.(degree of freedom)自由度31.distribution shape分布形状32.double logarithmic双对数33.eigenvector特征向量34.error of estimate估计误差35.estimation估计量36.Euclidean distance欧式距离37.expected value期望值38.experimental sampling实验抽样39.explanatory variable说明变量40.explore Summarize探索摘要41.extreme value41.extreme value极值极值ikstri:m

39、vlju:42.factor score42.factor score因子得分因子得分fktsk:43.factorial designs43.factorial designs因子设计因子设计fkt:rildizain44.factorial experiment44.factorial experiment因子实验因子实验fkt:riliksperimnt45.finite population45.finite population有限总体有限总体fainait,ppjulein46.finite-sample46.finite-sample有限样本有限样本smpl47.F-test47

40、.F-testF F检验检验test48.function48.function函数函数fkn49.function relationship49.function relationship函数关系函数关系fknrileinip50.gamma distribution50.gamma distribution伽马分布伽马分布gm,distribju:n51.geometric mean51.geometric mean几何均值几何均值dimetrik mi:n52.goodness-of-fit52.goodness-of-fit拟合优度拟合优度gudnisfit53.group avera

41、ges53.group averages分组平均分组平均gru:pvrid54.grouped data54.grouped data分组资料分组资料deit55.grouped median55.grouped median组中值组中值mi:din56.hypothesis56.hypothesis假设假设haipisis57.hypothesis test57.hypothesis test假设检验假设检验haipisistest58.hypothetical universe58.hypothetical universe假设总体假设总体haipuetiklju:niv:s59.impo

42、ssible event59.impossible event不可能事件不可能事件 impsblivent60.independent samples60.independent samples独立样本独立样本,indipendnt smpl61.independent variable61.independent variable自变量自变量vribl62.infinitely great62.infinitely great无穷大无穷大infinitligreit63.interclass correlation63.interclass correlation组内相关组内相关intkl:s,k:rilein64.inter-item correlation64.inter-item correlation样本内相关样本内相关 aitm,k:rilein65.item means65.item means样本均值样本均值aitmmi:n

展开阅读全文