1、Simple Linear RegressionChapter 12ObjectivesIn this chapter,you learn:How to use regression analysis to predict the value of a dependent variable based on a value of an independent variableTo understand the meaning of the regression coefficients b0 and b1To evaluate the assumptions of regression ana
2、lysis and know what to do if the assumptions are violatedTo make inferences about the slope and correlation coefficientTo estimate mean values and predict individual valuesCorrelation vs.RegressionA scatter plot can be used to show the relationship between two variablesCorrelation analysis is used t
3、o measure the strength of the association(linear relationship)between two variablesCorrelation is only concerned with strength of the relationship No causal effect is implied with correlationScatter plots were first presented in Ch.2Correlation was first presented in Ch.3DCOVATypes of RelationshipsY
4、XYXYYXXLinear relationshipsCurvilinear relationshipsDCOVATypes of RelationshipsYXYXYYXXStrong relationshipsWeak relationships(continued)DCOVATypes of RelationshipsYXYXNo relationship(continued)DCOVAIntroduction to Regression AnalysisRegression analysis is used to:Predict the value of a dependent var
5、iable based on the value of at least one independent variableExplain the impact of changes in an independent variable on the dependent variableDependent variable:the variable we wish to predict or explainIndependent variable:the variable used to predict or explain the dependent variableDCOVASimple L
6、inear Regression ModelnOnly one independent variable,XnRelationship between X and Y is described by a linear functionnChanges in Y are assumed to be related to changes in XDCOVAii10iXYLinear componentSimple Linear Regression ModelPopulation Y intercept Population SlopeCoefficient Random Error termDe
7、pendent VariableIndependent VariableRandom Error componentDCOVA(continued)Random Error for this Xi valueYXObserved Value of Y for XiPredicted Value of Y for Xi ii10iXYXiSlope=1Intercept=0 iSimple Linear Regression ModelDCOVASimple Linear Regression Equation(Prediction Line)i10iXbbYThe simple linear
8、regression equation provides an estimate of the population regression lineEstimate of the regression interceptEstimate of the regression slopeEstimated (or predicted)Y value for observation iValue of X for observation iDCOVAThe Least Squares Methodb0 and b1 are obtained by finding the values that mi
9、nimize the sum of the squared differences between Y and :2i10i2ii)Xb(b(Ymin)Y(YminYDCOVAFinding the Least Squares EquationThe coefficients b0 and b1,and other regression results in this chapter,will be found using Excel or MinitabFormulas are shown in the text for those who are interestedDCOVAnb0 is
10、 the estimated mean value of Y when the value of X is zeronb1 is the estimated change in the mean value of Y as a result of a one-unit increase in XInterpretation of the Slope and the InterceptDCOVASimple Linear Regression ExamplenA real estate agent wishes to examine the relationship between the se
11、lling price of a home and its size(measured in square feet)nA random sample of 10 houses is selectednDependent variable(Y)=house price in$1000snIndependent variable(X)=square feetDCOVASimple Linear Regression Example:DataHouse Price in$1000s(Y)Square Feet(X)245140031216002791700308187519911002191550
12、4052350324245031914252551700DCOVASimple Linear Regression Example:Scatter PlotHouse price model:Scatter PlotDCOVASimple Linear Regression Example:Using Excel Data Analysis Function1.Choose Data2.Choose Data Analysis3.Choose RegressionDCOVASimple Linear Regression Example:Using Excel Data Analysis Fu
13、nction(continued)Enter Y range and X range and desired optionsDCOVASimple Linear Regression Example:Using PHStatAdd-Ins:PHStat:Regression:Simple Linear RegressionSimple Linear Regression Example:Excel OutputRegression StatisticsMultiple R0.76211R Square0.58082Adjusted R Square0.52842Standard Error41
14、.33032Observations10ANOVA dfSSMSFSignificance FRegression118934.934818934.934811.08480.01039Residual813665.56521708.1957Total932600.5000 CoefficientsStandard Errort StatP-valueLower 95%Upper 95%Intercept98.2483358.033481.692960.12892-35.57720232.07386Square Feet0.109770.032973.329380.010390.033740.1
15、8580The regression equation is:feet)(square 0.10977 98.24833 price houseDCOVASimple Linear Regression Example:Minitab OutputThe regression equation isPrice=98.2+0.110 Square Feet Predictor Coef SE Coef T PConstant 98.25 58.03 1.69 0.129Square Feet 0.10977 0.03297 3.33 0.010 S=41.3303 R-Sq=58.1%R-Sq(
16、adj)=52.8%Analysis of Variance Source DF SS MS F PRegression 1 18935 18935 11.08 0.010Residual Error 8 13666 1708Total 9 32600The regression equation is:house price=98.24833+0.10977(square feet)DCOVASimple Linear Regression Example:Graphical RepresentationHouse price model:Scatter Plot and Predictio
17、n Linefeet)(square 0.10977 98.24833 price houseSlope=0.10977Intercept=98.248 DCOVASimple Linear Regression Example:Interpretation of bob0 is the estimated mean value of Y when the value of X is zero(if X=0 is in the range of observed X values)Because a house cannot have a square footage of 0,b0 has
18、no practical applicationfeet)(square 0.10977 98.24833 price houseDCOVASimple Linear Regression Example:Interpreting b1b1 estimates the change in the mean value of Y as a result of a one-unit increase in XnHere,b1=0.10977 tells us that the mean value of a house increases by.10977($1000)=$109.77,on av
19、erage,for each additional one square foot of sizefeet)(square 0.10977 98.24833 price houseDCOVA317.850)0.1098(200 98.25(sq.ft.)0.1098 98.25 price housePredict the price for a house with 2000 square feet:The predicted price for a house with 2000 square feet is 317.85($1,000s)=$317,850Simple Linear Re
20、gression Example:Making PredictionsDCOVASimple Linear Regression Example:Making PredictionsnWhen using a regression model for prediction,only predict within the relevant range of dataRelevant range for interpolationDo not try to extrapolate beyond the range of observed XsDCOVAMeasures of VariationnT
21、otal variation is made up of two parts:SSE SSR SSTTotal Sum of SquaresRegression Sum of SquaresError Sum of Squares2i)YY(SST2ii)YY(SSE2i)YY(SSRwhere:=Mean value of the dependent variableYi=Observed value of the dependent variable =Predicted value of Y for the given Xi valueiYYDCOVAnSST=total sum of
22、squares (Total Variation)nMeasures the variation of the Yi values around their mean YnSSR=regression sum of squares (Explained Variation)nVariation attributable to the relationship between X and YnSSE=error sum of squares (Unexplained Variation)nVariation in Y attributable to factors other than X(co
23、ntinued)Measures of VariationDCOVA(continued)XiYXYiSST=(Yi-Y)2SSE=(Yi-Yi)2 SSR=(Yi-Y)2 _Y YY_Y Measures of VariationDCOVAnThe coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variablenThe coefficient of det
24、ermination is also called r-square and is denoted as r2Coefficient of Determination,r21r02note:squares of sum total squares of sum regression2SSTSSRrDCOVAr2=1Examples of Approximate r2 ValuesYXYXr2=1Perfect linear relationship between X and Y:100%of the variation in Y is explained by variation in XD
25、COVAExamples of Approximate r2 ValuesYXYX0 r2 1Weaker linear relationships between X and Y:Some but not all of the variation in Y is explained by variation in XDCOVAExamples of Approximate r2 Valuesr2=0No linear relationship between X and Y:The value of Y does not depend on X.(None of the variation
26、in Y is explained by variation in X)YXr2=0DCOVASimple Linear Regression Example:Coefficient of Determination,r2 in ExcelRegression StatisticsMultiple R0.76211R Square0.58082Adjusted R Square0.52842Standard Error41.33032Observations10ANOVA dfSSMSFSignificance FRegression118934.934818934.934811.08480.
27、01039Residual813665.56521708.1957Total932600.5000 CoefficientsStandard Errort StatP-valueLower 95%Upper 95%Intercept98.2483358.033481.692960.12892-35.57720232.07386Square Feet0.109770.032973.329380.010390.033740.1858058.08%of the variation in house prices is explained by variation in square feet0.58
28、08232600.500018934.9348SSTSSRr2DCOVASimple Linear Regression Example:Coefficient of Determination,r2 in MinitabThe regression equation isPrice=98.2+0.110 Square Feet Predictor Coef SE Coef T PConstant 98.25 58.03 1.69 0.129Square Feet 0.10977 0.03297 3.33 0.010 S=41.3303 R-Sq=58.1%R-Sq(adj)=52.8%Ana
29、lysis of Variance Source DF SS MS F PRegression 1 18935 18935 11.08 0.010Residual Error 8 13666 1708Total 9 326000.5808232600.500018934.9348SSTSSRr258.08%of the variation in house prices is explained by variation in square feetDCOVAStandard Error of EstimatenThe standard deviation of the variation o
30、f observations around the regression line is estimated by2)(212nYYnSSESniiiYXWhereSSE =error sum of squares n=sample sizeDCOVASimple Linear Regression Example:Standard Error of Estimate in ExcelRegression StatisticsMultiple R0.76211R Square0.58082Adjusted R Square0.52842Standard Error41.33032Observa
31、tions10ANOVA dfSSMSFSignificance FRegression118934.934818934.934811.08480.01039Residual813665.56521708.1957Total932600.5000 CoefficientsStandard Errort StatP-valueLower 95%Upper 95%Intercept98.2483358.033481.692960.12892-35.57720232.07386Square Feet0.109770.032973.329380.010390.033740.1858041.33032S
32、YXDCOVASimple Linear Regression Example:Standard Error of Estimate in MinitabThe regression equation isPrice=98.2+0.110 Square Feet Predictor Coef SE Coef T PConstant 98.25 58.03 1.69 0.129Square Feet 0.10977 0.03297 3.33 0.010 S=41.3303 R-Sq=58.1%R-Sq(adj)=52.8%Analysis of Variance Source DF SS MS
33、F PRegression 1 18935 18935 11.08 0.010Residual Error 8 13666 1708Total 9 3260041.33032SYXDCOVAComparing Standard ErrorsYYXXYXS smallYXS largeSYX is a measure of the variation of observed Y values from the regression lineThe magnitude of SYX should always be judged relative to the size of the Y valu
34、es in the sample datai.e.,SYX=$41.33K is moderately small relative to house prices in the$200K-$400K rangeDCOVAAssumptions of RegressionL.I.N.EnLinearitynThe relationship between X and Y is linearnIndependence of ErrorsnError values are statistically independentnParticularly important when data are
35、collected over a period of timenNormality of ErrornError values are normally distributed for any given value of XnEqual Variance(also called homoscedasticity)nThe probability distribution of the errors has constant varianceDCOVAResidual AnalysisnThe residual for observation i,ei,is the difference be
36、tween its observed and predicted valuenCheck the assumptions of regression by examining the residualsnExamine for linearity assumptionnEvaluate independence assumption nEvaluate normal distribution assumption nExamine for constant variance for all levels of X(homoscedasticity)nGraphical Analysis of
37、ResidualsnCan plot residuals vs.XiiiYYeDCOVAResidual Analysis for LinearityNot LinearLinearxresidualsxYxYxresidualsDCOVAResidual Analysis for IndependenceCycial Pattern:Not IndependentNo Cycial Pattern IndependentXXresidualsresidualsXresidualsDCOVAChecking for NormalitynExamine the Stem-and-Leaf Dis
38、play of the ResidualsnExamine the Boxplot of the ResidualsnExamine the Histogram of the ResidualsnConstruct a Normal Probability Plot of the ResidualsDCOVAResidual Analysis for NormalityPercentResidualWhen using a normal probability plot,normal errors will approximately display in a straight line-3
39、-2 -1 0 1 2 30100DCOVAResidual Analysis for Equal Variance Non-constant varianceConstant variancexxYxxYresidualsresidualsDCOVASimple Linear Regression Example:Excel Residual OutputRESIDUAL OUTPUTPredicted House Price Residuals1251.92316-6.9231622273.8767138.123293284.85348-5.8534844304.062843.937162
40、5218.99284-19.992846268.38832-49.388327356.2025148.797498367.17929-43.179299254.667464.3326410284.85348-29.85348Does not appear to violate any regression assumptionsDCOVASimple Linear Regression Example:Minitab Residual OutputDCOVA100500-50-100999050101ResidualPercent36032028024020050250-25-50Fitted
41、 ValueResidual7550250-25-503210ResidualFrequency1098765432150250-25-50Observation OrderResidualNormal Probability PlotVersus FitsHistogramVersus OrderResidual Plots for House Price(Y)Does not appear to violate any regression assumptionsnUsed when data are collected over time to detect if autocorrela
42、tion is presentnAutocorrelation exists if residuals in one time period are related to residuals in another periodMeasuring Autocorrelation:The Durbin-Watson StatisticDCOVAAutocorrelationAutocorrelation is correlation of the errors(residuals)over timenViolates the regression assumption that residuals
43、 are random and independentnHere,residuals show a cyclic pattern,not random.Cyclical patterns are a sign of positive autocorrelationDCOVAThe Durbin-Watson Statisticn1i2in2i21iie)ee(Dn The possible range is 0 D 4n D should be close to 2 if H0 is truen D less than 2 may signal positive autocorrelation
44、,D greater than 2 may signal negative autocorrelationnThe Durbin-Watson statistic is used to test for autocorrelationH0:positive autocorrelation doe not existH1:positive autocorrelation is presentDCOVATesting for Positive Autocorrelationn Calculate the Durbin-Watson test statistic=D (The Durbin-Wats
45、on Statistic can be found using Excel or Minitab)Decision rule:reject H0 if D dLH0:positive autocorrelation does not existH1:positive autocorrelation is present0dU2dLReject H0Do not reject H0n Find the values dL and dU from the Durbin-Watson table (for sample size n and number of independent variabl
46、es k)InconclusiveDCOVASuppose we have the following time series data:Is there autocorrelation?Testing for Positive Autocorrelation(continued)DCOVAnExample with n=25:Durbin-Watson CalculationsSum of Squared Difference of Residuals3296.18Sum of Squared Residuals3279.98Durbin-Watson Statistic1.00494Tes
47、ting for Positive Autocorrelation(continued)Excel/PHStat output:1.004943279.983296.18e)e(eDn1i2in2i21iiDCOVAnHere,n=25 and there is k=1 one independent variablenUsing the Durbin-Watson table,dL=1.29 and dU=1.45nD=1.00494 dL=1.29,so reject H0 and conclude that significant positive autocorrelation exi
48、stsTesting for Positive Autocorrelation(continued)Decision:reject H0 since D=1.00494 dL0dU=1.452dL=1.29Reject H0Do not reject H0InconclusiveDCOVAInferences About the SlopenThe standard error of the regression slope coefficient(b1)is estimated by2iYXYXb)X(XSSSXSS1where:=Estimate of the standard error
49、 of the slope =Standard error of the estimate1bS2nSSESYXDCOVAInferences About the Slope:t Testnt test for a population slopenIs there a linear relationship between X and Y?nNull and alternative hypothesesn H0:1=0(no linear relationship)n H1:1 0(linear relationship does exist)nTest statistic 1b11STAT
50、Sbt2nd.f.where:b1=regression slope coefficient 1=hypothesized slope Sb1=standard error of the slopeDCOVAInferences About the Slope:t Test ExampleHouse Price in$1000s(y)Square Feet(x)2451400312160027917003081875199110021915504052350324245031914252551700(sq.ft.)0.1098 98.25 price houseEstimated Regres