1、回归分析回归分析Regression Analysis目的目的Objectivesl介绍相关性及回归的基本概念介绍相关性及回归的基本概念 Introduce The Basic Concepts of Correlation and Regressionl把回归与六西格玛路线图结合起来把回归与六西格玛路线图结合起来 Link Regression To The Six Sigma Roadmapl学习多元回归的使用学习多元回归的使用 Review the use of Multiple Regressionl介绍相关性及回归的基本概念介绍相关性及回归的基本概念 Introduce The Ba
2、sic Concepts of Correlation and Regressionl把回归与六西格玛路线图结合起来把回归与六西格玛路线图结合起来 Link Regression To The Six Sigma Roadmapl学习多元回归的使用学习多元回归的使用 Review the use of Multiple Regressionl介绍相关性及回归的基本概念介绍相关性及回归的基本概念 Introduce The Basic Concepts of Correlation and Regressionl把回归与六西格玛路线图结合起来把回归与六西格玛路线图结合起来 Link Regres
3、sion To The Six Sigma Roadmapl学习多元回归的使用学习多元回归的使用 Review the use of Multiple Regression项目跟踪图项目跟踪图 第五版 项目开始日期21/01/2004项目类别“Y”“Y”变量数据变量数据采集计划采集计划制定项目制定项目 日程日程启动项目书启动项目书DMAIC改善定义定义确定”Y”变量和起草项目书项目书得以批准流程图流程图C&EC&E矩阵或矩阵或故障树分析故障树分析FTAFTA第三十天第三十天MBBMBB审阅审阅FMEAFMEA或或故障树分析故障树分析FTAFTA测量系统分析测量系统分析MSAMSA关键关键”X”
4、X”变量变量 数据采集计划数据采集计划MBBMBB审阅审阅测量测量21/01/200421/01/200404/02/200404/02/200411/02/200411/02/200425/02/200425/02/200409/03/200409/03/200409/03/200409/03/200409/03/200409/03/2004初始能力研究初始能力研究 多元变量流程分析多元变量流程分析MBBMBB审阅审阅合同批准合同批准分析分析22/03/200422/03/200415/04/200415/04/200415/04/200415/04/200415/04/200415/04/
5、200415/04/200415/04/2004单因子或多因子测试单因子或多因子测试实验设计实验设计(DOE)(DOE)MBBMBB审阅审阅改善改善31/05/200431/05/200431/05/200431/05/200431/05/200431/05/2004控制计划控制计划最终能力研究最终能力研究 控制阶段控制阶段FMEAFMEA回顾回顾 重新修订重新修订RPNRPNMBBMBB审阅审阅项目最终汇报项目最终汇报 及报告及报告项目审核项目审核及项目收尾及项目收尾控制控制21/06/200421/06/200429/06/200429/06/200429/06/200429/06/200
6、405/07/200405/07/200409/07/200409/07/200409/07/200409/07/200419/07/200419/07/2004(根据需要使用根据需要使用)客户心声客户心声/业务之声调查业务之声调查VOC/VOBVOC/VOB需求分析需求分析流程再造流程再造 解决方案设计解决方案设计流程再造流程再造在这里输入开始日期在这里输入开始日期 确定改善方案确定改善方案由项目发起人在备选项目数据库中完成由项目发起人在备选项目数据库中完成在在6 6西格玛西格玛数据库数据库查找相似项目查找相似项目实施改善实施改善移交培训移交培训/流程所有人签准流程所有人签准再造路线图的日程
7、是独立计算的与以上DMAIC的日期不相关实际完成日期 计划完成日期图例图例2/1/020022/3/022/3/02完成画钩分析路线图分析路线图Analyze Roadmap 单一因子 X-单一因子 Y Single X-Single Y输入变量输入变量 X X X Data离散离散Discrete 连续连续Continuous 输出变量输出变量 Y Y Y Data离散离散Discrete 连续连续Continuous 卡方相关性分析卡方相关性分析Chi-Square逻辑回归逻辑回归Logistic Regression方差分析,方差分析,均值均值/中位数测试中位数测试ANOVAMeans/
8、Medians Tests回归回归Regression什么是什么是 Y?Y?_ _ 数据类型数据类型?_什么是什么是 X?X?_ 数据类型数据类型?_?_应该使用何种工具应该使用何种工具?_案例案例#1#1 Scenario#1管理者想知道接线员的经验管理者想知道接线员的经验(以月为单位衡量以月为单位衡量)是否会对接听顾客热线电话需要的时间有影响是否会对接听顾客热线电话需要的时间有影响相关性相关性Correlation 什么是相关性什么是相关性?What is correlation?你是否有过如此经验你是否有过如此经验:测量某些产品并送至顾客处测量某些产品并送至顾客处,但,但他们回来告诉你的
9、产品不符规格他们回来告诉你的产品不符规格?Have you ever measured something and then shipped to your customer only for them to tell you it doesnt meet spec?在奥林匹克溜冰比赛上,你认为两个裁判成绩之相关在奥林匹克溜冰比赛上,你认为两个裁判成绩之相关性有多高性有多高?How well correlated do you think two ice skating judges are at the Olympics?相关性相关性Correlation路线分析图路线分析图Analyze
10、Roadmap 画出点阵图画出点阵图Produce Scatter Plot计算相关性计算相关性Calculate Correlation评估评估r r 和和 P P值值 Evaluate r and P value相关系数相关系数Correlation Coefficients 什么是相关系数什么是相关系数?So what is the Correlation Coefficient supposed to be anyway?相关系数相关系数(r)(r)介于介于-1-1和和1 1之间之间 The Correlation Coefficient(r)lies between-1 and 1
11、一般规则一般规则:General Rules 相关系数相关系数(r).80 (r).80 或或-0.8 .80 or 刹车距离Braking Distance=182.8+0.4763 速度速度SpeedS=13.5571 R-Sq=69.5%R-Sq(adj)=67.9%方差分析方差分析Analysis of VarianceSource DF SS MS F PRegression 1 7955.9 7955.91 43.29 0.000Error 19 3492.1 183.79Total 20 11448.0Minitab Minitab 更多输出更多输出R2(Same one as
12、before)R2 -R2 -有何意义有何意义?R2与P值,有助我们以统计做决策。R2被称为 判断判断系数系数R2 and P,help us put some statistical backing behind our decisions.The R2 is called the coefficient of determinationR2 值代表“多少”输出变异总量可由回归模式所解释,其值介于 0 到 1(0%到 100%)。此值越高代表对该模式的可信度越高.R2 is a measure of the amount of variation in the output that is
13、explained by the regression model.It will always be a value between 0 and 1(0%to 100%).The higher this amount,the greater confidence we have in the model itself.R2100%0%R2 -有何意义有何意义?The R2=69.5%这表明有69.5%的Y(刹车距离)的变差可以由X(速度)来解释.This means 69.5%of the variation in Y(Braking Distance)can be explained by
14、 the X(Speed).30.5%30.5%是由其他因素引起的是由其他因素引起的.30.5%is due to something else.你的决策是什么?SpeedBraking Distance475450425400375350420400380360340320S13.5571R-Sq69.5%R-Sq(adj)67.9%Fitted Line PlotBraking Distance=182.8+0.4763 SpeedR2-该为多大值?How Big Should It Be?视分析对象而定 如对安全系统或回纹针 That answer“depends”on what you
15、 are studying,e.g.safety systems or paper clips.如果你在实验一个新的安全保障系统,你的数据将由交通部审查.你的数值该需要有多“好”?If you are experimenting with a new safety restraint system,your numbers will probably be reviewed by the Department of Transportation.How“good”should you be?不同的课题会有不同的决策标准(通常为+80%)。重要的是我们必须认识到 R2 越高,统计模式越好。Dif
16、ferent texts suggest different decision criteria(usually+80%).The important thing to realize is that the higher the R2 the better the model.回归分析:刹车距离v.速度Regression Analysis:Braking Dist versus Speed回归的等式为The regression equation is刹车距离Braking Distance=182.8+0.4763 速度SpeedS=13.5571 R-Sq=69.5%R-Sq(adj)
17、=67.9%方差分析Analysis of VarianceSource DF SS MS F PRegression 17955.97955.91 43.29 0.000Error 193492.1 183.79Total 2011448.0P值里怎么了?What Is Going On Here?Another P Value!零假设:线段斜率=0(无相关性)Ho:Slope of The Line=0(No correlation)备择假设:线段斜率=0(有相关性)Ha:Slope of The Line 0(There is correlation)记住P P要小要小,Ho Ho要倒要
18、倒When P is low,Ho must go!P 值另一个假设检定Another Hypothesis TestMinitab 回归-残差&拟合数Regression-Residuals&FitsSpeed DistanceRESI1FITS1336325-17.8392342.839418375-6.8948381.89535536715.1113351.889445385-9.7546394.75536537518.3484356.652455395-4.5175399.51739539524.0598370.940405365-10.7031375.7033463557.39793
19、47.60.Minitab 更多输出More Output速度Speed距离Distance残差1 RESI1拟合数1 FITS1336325-17.8392342.839残差&拟合数-它们是什么?Residual&Fit-What Are They?拟和线Fitted Line336325实际点Actual Point残差距离Residual Distance (-17.8392)理论拟合点Theoretical Fit 342速度Speed 距离Distance残差1 RESI1 拟合数1 FITS1336325-17.8392 342.839残差-点到拟合线的垂直距离 在线下方为负,在线
20、上方为正.Residual-The vertical distance to the fitted line Negative is below,positive is above拟合数拟合数-Y值在拟合线上的理论值Fits-The theoretical y value on the fitted line残差&拟合数-它们是什么?Residual&Fit-What Are They?回归-残差&拟合数-图表总结Regression-Residuals&Fits Graphical Summary数据应该通过“胖铅笔测试”“Fat Pencil Test”残差分析Residual Analy
21、sis数据应该像钟型分布Data Should Fit A Bell Shaped CurveResidualPercent30150-15-30999050101Fitted ValueResidual40038036034020100-10-20ResidualFrequency20100-10-206.04.53.01.50.0Observation OrderResidual201816141210864220100-10-20Normal Probability Plot of the ResidualsResiduals Versus the Fitted ValuesHistog
22、ram of the ResidualsResiduals Versus the Order of the DataResidual Plots for Braking Distance比较P值与残差正态分布测试的结果Check P value with Normality test on Residuals数据应在控制线内,调查异常点Data Should Be In ControlInvestigate Outliers残差分析Residual Analysis数据应无任何规律Data Should Exhibit No PatternsResidualPercent30150-15-30
23、999050101Fitted ValueResidual40038036034020100-10-20ResidualFrequency20100-10-206.04.53.01.50.0Observation OrderResidual201816141210864220100-10-20Normal Probability Plot of the ResidualsResiduals Versus the Fitted ValuesHistogram of the ResidualsResiduals Versus the Order of the DataResidual Plots
24、for Braking Distance其他案例Other Examples使用Minitab Project:练习#1:Analyze worksheet Y=油漆厚度Paint Thickness X1=气压Air Pressure X2=黏度Viscosity练习#2:Analyze worksheet Y=客户回应时间Customer Response TimeX1=代理人有经验程度Experience Level of AgentX2=与客户的距离Distance From Customer Site练习#3:Analyze 注意陈述中的注意陈述中的因果关系因果关系Beware of
25、 Stating Causality即使我们建立了Y与X之相关性,但并不能确定X之变异将一定导致Y之变异。If we establish a correlation between Y and a X,that doesnt necessarily mean variation in X caused variation in Y.其它潜藏的变量,可能造成X与Y之改变。Other variables may be lurking that cause both X and Y to vary.研究指出当医院规模增加,病人死亡率亦显著提升。这么说来,我们应该避免去大型医院就诊吗?Research
26、 has consistently shown that as the hospital size increases,the death rate of patients dramatically increases.So,should we avoid large hospitals?回归问题探讨:回归问题探讨:Xs Xs 缺失缺失 Regression Issues-Missing Xs0 1 2 4 5 X=医院规模Y=死亡率15105有关一个城市的数据显示,当城市里鹳的数量增加时,城市人口也增加鹳真的影响城市人口吗?Data on a city showed that as popu
27、lation density of storks increased,so did the towns population.Did storks influence the population?0 1 2 4 5 X=X=鹳的数量鹳的数量Y=Y=城市人口城市人口15105回归问题探讨:回归问题探讨:Xs Xs 缺失缺失 Regression Issues-Missing Xs回归问题探讨回归问题探讨Regression Issues 研究范围太狭窄研究范围太狭窄Range Of Study Too Small0 1 2 4 5 X=X=车龄车龄Age of CarY=Y=车值车值 Sale
28、s Value15105$车值车值Value of Car车龄车龄Age of Car现在的数据看来如何?What Would This Look Like Now?0 15 10 15 20 25 30 35 40 45 50回归问题探讨回归问题探讨Regression Issues 研究范围太狭窄Range Of Study Too Small分析路线图分析路线图Analyze Roadmap 输入变量输入变量 X X X Data单一因子单一因子 XSingle X多因子多因子 XsMultiple Xs 输出变量输出变量 Y Y Y Data单一输出单一输出 Y Single Y 多元
29、输出多元输出 Y Multiple Ys 多变量分析Multivariate Analysis(注意:这与多元变量分析不同)(Note:This Is Not The Same As Multi-Vari Analysis)输入变量输入变量 X X Data离散 Discrete 连续 Continuous 输出变量输出变量 Y Y Y Data卡方相关性分析Chi-Square逻辑回归Logistic RegressionT T 测试,方差分析,均值/中位数测试T-test,ANOVAMeans/Medians Tests回归Regression多元回归Multiple Regression
30、Medians Tests2,3,4 way.ANOVAMultiple Logistic Regression多元逻辑回归离散 Discrete 连续 Continuous 离散 Discrete 连续 Continuous 离散 Discrete 连续 Continuous 2,3,4 因子方差分析中位数测试多元逻辑回归Multiple Logistic Regression输入变量输入变量 X X Data输出变量输出变量 Y Y Data多元回归分析Multiple Regression Analysis 两个或多个流程变量(Xs)可能对流程表现产生影响(Y).Two or more
31、process variables(Xs)may have an influence upon process performance(Y).多元回归应用于有两个或多个可能的预测变量的情况Multiple regression is used whenever there are two or more possible predictor variables.多元回归的一般等式为The general form of the multiple regression equation isnnXbXbXbbY.22110案例:刹车板销售量Example:Brake Sales例中对刹车板销售量
32、进行次的观察已知有五个流程变量和一个表现变量,:Twenty observations regarding Brake Sales are given.There are Five known process variables and one performance variable,Y:X1=年度YearX2=市场营销费用Mktg$X3=今年销售人员数Sales RepX4=去年(销售人员)数LY(Sales Rep)X5=产品ProductY =销售Sales利用数据找出可能影响”销售量”的”重要的几个”流程变量.Use the data to mine for the“vital fe
33、w”process variables that may influence“Sales”.刹车板销售量数据YearMktg$SalesRep LY(SalesRep)ProductSales19.63020 18130210.3203017157310.2152019129410.4251522129510.6302524162610.7153018154710.5251517132810.9352516172911.04035142071011.12040182041111.22520221441211.23525251751311.4535271671411.212528971511.6
34、1612181221611.72116161391711.82221151531811.82422161561911.82624101722012.1282618178刹车板销售量55443322110XbXbXbXbXbbY我们的目的是找到适用于下列形式的多元回归:Our goal is to fit a multiple regression of the following form这个问题便于阐明下列多元回归的其他方面:This problem will illustrate the following additional aspects of multiple regression
35、(1)去掉没有解释能力的变量 elimination of X-variables that have no explanatory power;(2)残差分析 residual analysis留在模式里的变量是能控制的在西格玛里,我们的目标就是要控制少数变量What stays in the model must have controls.In Six Sigma,goal is to control a few.多元回归Multiple Regression路线分析图规划分析内容收集数据利用回归或最佳子集分析Analyze Using Regression or Best Subset
36、s评估残差制定决策评估 R2 及 P值的显著性多元共线性分析(相关性)Multicollinearity“X”Check(correlation)使用多元回归简化模式Run Multiple Regression Reduced Model因为有多条线,就不再使用拟合线图,No longer fitted line plot due to multiple lines相关的预测变量(多元共线性)相关的预测变量(多元共线性)Correlated Predictor Variables(Multicollinearity)nnXbXbXbbY.22110流程结果()与预测变量(s)间的相关性是有用
37、的它可以帮助我们找出可能的因果关系 Correlation between the process output(Y)and the predictor variables(Xs)is good-helps us identify possible cause and effect relationships.相反,预测变量间的相关性却是一个问题 Correlation between predictors,in contrast,is a problem.计算里的正负符号和预测变量间的相关性大小可能有错误.Calculated signs and magnitudes of correlat
38、ed predictors may be wrong.计算出的P值可能偏大.Calculated P-values may be large.预测变量间的高相关性被称为”共线性”High correlation between predictor variables is called“collinearity”多元共线性:刹车板销售量Multicollinearity:Brake Sales左侧是前刹车板销售量预测变量:Predictor Variables:(1)年度Year;(2)市场营销费用Marketing$;(3)今年销售人员数量How many Sales Reps this y
39、ear;(4)去年销售人员数量How many Sales Reps last year.(5)产品Product YearMktg$SalesRepLY(SalesRep)Product Sales19.6302018130210.3203017157310.2152019129410.4251522129510.6302524162610.7153018154710.5251517132810.9352516172911.04035142071011.12040182041111.22520221441211.23525251751311.4535271671411.21252897151
40、1.61612181221611.72116161391711.82221151531811.82422161561911.82624101722012.1282618178多元共线性:刹车板销售量多元共线性:刹车板销售量选择所有五个预测变量和响应变量Select all five predictor variables and the response variable.使用 Minitab 菜单,STAT BASIC STATS CORRELATION.不选择p值选项Uncheck p value年度和市场营销费用有着很高的相关性!我们必须只能选择一个作为预测变量在回归拟合中使用市场营销费
41、用可能受年度影响,因此我们保留市场营销费用,而去掉年度变量The Year and Marketing$Variables are highly correlated!We will have to choose one or the other of the correlated predictor variables(but not both)to use in a regression fit.Possible that marketing$is a function of the year-so keep the marketing$and eliminate year.基本原则基本原
42、则,如果相关性如果相关性 0.8 or0.8 or -0.8 Regression Best Subsets.最佳子集回归:刹车板销售 注意”年度”从模式中去掉了.Best Subsets Regression:Sales versus Mktg$,Sales Rep,.Response is Sales S L a Y P l(r M e S o k s a d t l u g R e c Vars R-Sq R-Sq(adj)C-p S$e s t 1 79.0 77.8 156.0 12.841 X 1 20.9 16.6 631.3 24.910 X 2 90.1 89.0 66.8
43、9.0570 X X 2 85.2 83.5 107.0 11.084 X X 3 98.2 97.8 3.0 4.0222 X X X 3 90.5 88.7 65.8 9.1570 X X X 4 98.2 97.7 5.0 4.1540 X X X X 多元回归Multiple Regression路线分析图规划分析内容收集数据利用回归或最佳子集分析Analyze Using Regression or Best Subsets评估残差制定决策评估 R2 及 P值的显著性多元共线性分析(相关性)Multicollinearity“X”Check(correlation)使用多元回归简化模
44、式Run Multiple Regression Reduced Model因为有多条线,就不再使用拟合线图,No longer fitted line plot due to multiple lines回归:刹车板销售Regression:Brake Sales 选择所有四个预测变量和响应变量.Select all four predictor variables and the response variable.使用 Minitab 菜单,STAT Regression Regression回归分析:刹车板销售Regression Analysis:Brake Sales 零假设 =变
45、量间没有任何关系备择假设=变量间有一些关系Ho=No relationship between variables Ha=Some relationship exists between variablesRegression Analysis:Sales versus Mktg$,Sales Rep,.The regression equation isSales=-66.6+11.8 Mktg$+1.18 Sales Rep+2.70 LY(SalesRep)-0.007 ProductPredictor Coef SE Coef T PConstant -66.64 19.17 -3.4
46、8 0.003Mktg$11.838 1.494 7.92 0.000 HaSales Re 1.1751 0.1224 9.60 0.000HaLY(Sales 2.7023 0.1154 23.42 0.000HaProduct -0.0068 0.2337 -0.03 0.977HoS=4.154 R-Sq=98.2%R-Sq(adj)=97.7%回归/简化模式:刹车板销售Regression/Reduced Model:Brake Sales 选择所剩三个预测变量和响应变量.Select the three remaining predictor variables and the r
47、esponse variable.Using Minitab Menu,STAT Regression Regression记住检查残差图记住检查残差图Remember to check your residual plots回归分析:刹车板销售Regression Analysis:Brake Sales 零假设 =变量间没有任何关系备择假设=变量间有一些关系Ho=No relationship between variables Ha=Some relationship exists between variables回归分析:销售量v.市场营销费用,销售人员数,去年销售人员数Regres
48、sion Analysis:Sales versus Mktg$,Sales Rep,LY(SalesRep)The regression equation isSales=-66.9+11.8 Mktg$+1.18 Sales Rep+2.70 LY(SalesRep)Predictor Coef SE Coef T PConstant -66.91 16.22 -4.12 0.001Mktg$11.847 1.414 8.38 0.000HaSales Re 1.1764 0.1106 10.64 0.000HaLY(Sales 2.7027 0.1106 24.44 0.000HaS=4
49、.022 R-Sq=98.2%R-Sq(adj)=97.8%刹车板销售案例的其他MiniTab 输出The Rest of Mini Tab Output Brake Sales Analysis of VarianceSource DF SS MS F PRegression 3 13870.1 4623.4 285.78 0.000Residual Error 16 258.8 16.2Total 19 14128.9Source DF Seq SSMktg$1 893.9Sales Re 1 3313.2LY(Sales 1 9663.0Unusual ObservationsObs M
50、ktg$Sales Fit SE Fit Residual St Resid 10 11.1 204.000 196.236 2.161 7.764 2.29R R denotes an observation with a large standardized residual刹车板销售R-Sq(修正后)Brake Sales R-Sq(Adjusted)R-Sq(Adj)=97.8%Y的变差可由回归里的三个元素解释.R-Sq(Adj)=97.8%of the variation in Y is explained by the Three factors included in the r