1、 STATA学习系列学习系列 医学资料1STATA学习系列 Regress部分部分(续续)-回归诊断分析回归诊断分析 1.Census数据实际操作处理(分析模型) 2.Auto数据回归诊断分析(图象分析方法) 3.Exdata数据分析实际应用医学资料2基本的数据转换:excel stata1.将excel数据导入stata 第一步:将excel文件另存为用制表符隔开制表符隔开的txt 文件; 第二步:用命令: insheet using d:stata/name.txt;2.将stata数据导出用excel打开 第一步:outsheet using d:/stataname .out(生成文件
2、位置) 第二步:用excel打开.out文件即可.医学资料31.Census数据实际操作处理Use d:/stata/census1.数据说明:. describeContains data from d:stata/census.dta obs: 50 1980 Census data by state vars: 12 6 Jul 2000 17:06 size: 3,000 (99.4% of memory free)- storage display valuevariable name type format label variable label-state str14 %-14
3、s Stateregion int %-8.0g cenreg Census regionpop long %12.0gc Populationpoplt5 long %12.0gc Pop, F = 0.0000 Residual | .000027249 46 5.9236e-07 R-squared = 0.6724-+- Adj R-squared = 0.6510 Total | .000083179 49 1.6975e-06 Root MSE = .00077- drate | Coef. Std. Err. t P|t| 95% Conf. Interval-+- medage
4、 | .0004851 .001207 0.40 0.690 -.0019446 .0029147 medagesq | 2.37e-06 .0000206 0.12 0.909 -.000039 .0000437 pcturban | -.0035348 .0008293 -4.26 0.000 -.0052042 -.0018655 _cons | -.005598 .0178979 -0.31 0.756 -.0416246 .0304286-医学资料71.Census数据,对模型分析注意medage和medagesq的系数. test medage medagesq ( 1) meda
5、ge = 0.0 ( 2) medagesq = 0.0F( 2, 46) = 44.03 Prob F = 0.0000. test medage=2*medagesq ( 1) medage - 2.0 medagesq = 0.0 F( 1, 46) = 0.15 Prob F = 0.7021. test medage=200*medagesq ( 1) medage - 200.0 medagesq = 0.0 F( 1, 46) = 0.00 Prob F = 0.9982医学资料81.Census数据,对模型分析. vce | medage medagesq pcturban _
6、cons-+- medage | 1.5e-06 medagesq | -2.5e-08 4.2e-10 pcturban | 3.2e-07 -5.7e-09 6.9e-07 _cons | -.000022 3.7e-07 -5.0e-06 .00032. vce,rho | medage medagesq pcturban _cons-+- medage | 1.0000 medagesq | -0.9985 1.0000 pcturban | 0.3235 -0.3352 1.0000 _cons | -0.9984 0.9942 -0.3385 1.0000 医学资料91.Censu
7、s数据,对模型分析. regress drate medage pcturban Source | SS df MS Number of obs = 50-+- F( 2, 47) = 48.22 Model | .000055922 2 .000027961 Prob F = 0.0000 Residual | .000027256 47 5.7993e-07 R-squared = 0.6723-+- Adj R-squared = 0.6584 Total | .000083179 49 1.6975e-06 Root MSE = .00076- drate | Coef. Std. E
8、rr. t P|t| 95% Conf. Interval-+- medage | .0006238 .0000658 9.48 0.000 .0004915 .0007562 pcturban | -.0035028 .0007731 -4.53 0.000 -.0050581 -.0019476 _cons | -.0076466 .0019034 -4.02 0.000 -.0114756 -.0038175-医学资料101.Census数据,对模型分析 对回归模型进行估计:. predict dhat (option xb assumed; fitted values) . summa
9、rize drate dhat Variable | Obs Mean Std. Dev. Min Max -+- drate | 50 .008436 .0013029 .0039915 .0106902 dhat | 50 .008436 .0010683 .0044936 .0110485医学资料111.Census数据,对模型分析影响因素分析:. predict influs,cooksd(cooksd用来衡量每个收集到的数值对回归系数的影响强度。)用来衡量每个收集到的数值对回归系数的影响强度。). summarize influs,detail Cooks D- Percentile
10、s Smallest 1% 1.35e-08 1.35e-08 5% 6.25e-06 4.54e-0610% .0000502 6.25e-06 Obs 5025% .0010358 .0000109 Sum of Wgt. 5050% .0043872 Mean .0639731 Largest Std. Dev. .256015875% .0200719 .191429190% .0610564 .3090287 Variance .065544195% .3090287 .5059252 Skewness 5.85796599% 1.735909 1.735909 Kurtosis 3
11、8.08436医学资料121.Census数据,对模型分析 list state if influ 4/50(4/n) state 2. Alaska 9. Florida 11. Hawaii 44. Utah . lvr2plot,s(state) trim (12) border (图象) 医学资料13LeverageNormalized residual squared1.7e-08.212856.025145.618882AlabamaAlaskaArizonaArkansasCaliforniaColoradoConnecticutDelawareFloridaGeorgiaHaw
12、aiiIdahoIllinoisIndianaIowaKansasKentuckyLouisianaMaineMarylandMassachusettMichiganMinnesotaMississippiMissouriMontanaNebraskaNevadaNew HampshirNew JerseyNew MexicoNew YorkN. CarolinaN. DakotaOhioOklahomaOregonPennsylvaniaRhode IslandS. CarolinaS. DakotaTennesseeTexasUtahVermontVirginiaWashingtonW.
13、VirginiaWisconsinWyoming医学资料141.Census数据,对模型分析. regress drate medage medagesq pcturban if influs F = 0.0000 Residual | .000024651 45 5.4780e-07 R-squared = 0.6698-+- Adj R-squared = 0.6478 Total | .000074657 48 1.5553e-06 Root MSE = .00074- drate | Coef. Std. Err. t P|t| 95% Conf. Interval-+- medage
14、 | .0028685 .0015954 1.80 0.079 -.0003448 .0060817 medagesq | -.0000364 .0000266 -1.37 0.178 -.0000899 .0000172 pcturban | -.0037377 .0008029 -4.66 0.000 -.0053549 -.0021205 _cons | -.0420036 .023994 -1.75 0.087 -.0903301 .0063229-医学资料152.Auto数据回归诊断分析(图象分析)Three key issues in identifying model sensi
15、tivity to dindividual observations.1.Residual2.Leverage:small residual,but if u delete the point,the estimates would change markedly,such a point is said to have high leverage.3.influential:we might ask which points in our data have a large effect on our estimated a or b etc.医学资料162.Auto数据回归诊断分析(图象分
16、析). use d:/stataauto. describeContains data from d:/stataauto.dta obs: 74 1978 Automobile Data vars: 12 7 Jul 2000 13:51 size: 3,478 (99.4% of memory free)- storage display valuevariable name type format label variable label-make str18 %-18s Make and Modelprice int %8.0gc Pricempg int %8.0g Mileage
17、(mpg)rep78 int %8.0g Repair Record 1978headroom float %6.1f Headroom (in.)trunk int %8.0g Trunk space (cu. ft.)weight int %8.0gc Weight (lbs.)length int %8.0g Length (in.)turn int %8.0g Turn Circle (ft.) displacement int %8.0g Displacement (cu. in.)gear_ratio float %6.2f Gear Ratioforeign byte %8.0g
18、 origin Car type-Sorted by: foreign 医学资料172.Auto数据回归诊断分析(图象分析) 分析目的: 汽车价格price与汽车里程mpg,重量weight,产地foreign以及产地和里程相互关系forxmpg之间的关系医学资料182.Auto数据回归诊断分析(图象分析). gen forxmpg= foreign* mpg. regress price weight mpg forxmpg foreign Source | SS df MS Number of obs = 74-+- F( 4, 69) = 21.22 Model | 350319665
19、4 87579916.3 Prob F = 0.0000 Residual | 284745731 69 4126749.72 R-squared = 0.5516-+- Adj R-squared = 0.5256 Total | 635065396 73 8699525.97 Root MSE = 2031.4- price | Coef. Std. Err. t P|t| 95% Conf. Interval-+- weight | 4.613589 .7254961 6.36 0.000 3.166264 6.060914 mpg | 263.1875 110.7961 2.38 0.
20、020 42.15527 484.2197 forxmpg | -307.2166 108.5307 -2.83 0.006 -523.7294 -90.70369 foreign | 11240.33 751.681 4.08 0.000 5750.878 16729.78 _cons | -14449.58 4425.72 -3.26 0.002 -23278.65 -5620.51-医学资料192.Auto数据回归诊断分析(图象分析) . vce,rho _cons | weight mpg forxmpg foreign -+- weight | 1.0000 mpg | 0.8408
21、 1.0000 forxmpg | -0.5594 -0.7695 1.0000 foreign | 0.6431 0.7747 -0.9715 1.0000 _cons | -0.9611 -0.9536 0.6861 -0.7407 1.0000医学资料202.Auto数据回归诊断分析(图象分析) Rvfplot: graphs a residual-versus-fitted plot, a graph of the residuals versus the fitted values.医学资料21rvfplot,border yline(0)ResidualsFitted values
22、1224.1311952.8-3312.977271.96医学资料222.Auto数据回归诊断分析(图象分析) 图象分析: 1.price 和自变量之间存在线性关系 2.residuals表现出一定的增加或者减少的特征-异方差(heteroskedasticity):the increasing or decreasing variation in the residuals with fitted values(拟合值).医学资料23对图象检验分析 ovtest:检查是否忽略掉了变量 ovtest Ramsey RESET test using powers of the fitted va
23、lues of price Ho: model has no omitted variables F(3, 66) = 7.77 Prob F = 0.0002 说明存在忽略变量医学资料242.Auto数据回归诊断分析(图象分析) . hettest Cook-Weisberg test for heteroskedasticity using fitted values of price Ho: Constant variance chi2(1) = 6.50 Prob chi2 = 0.0108 说明存在异方差医学资料252.Auto数据回归诊断分析(图象分析) lvr2plot :gra
24、phs a leverage-versus-squared residual plot,a graph of leverage against the (normalized) redisuals squared.医学资料26. lvr2plot,borderLeverageNormalized residual squared1.4e-06.185714.019285.358152医学资料27. lvr2plot,s(make) trim (12) borderLeverageNormalized residual squared1.4e-06.185714.019285.358152AMC
25、 ConcordAMC PacerAMC SpiritBuick CenturBuick ElectrBuick LeSabrBuick OpelBuick RegalBuick RivierBuick SkylarCad. DevilleCad. EldoradCad. SevilleChev. ChevetChev. ImpalaChev. MalibuChev. Monte Chev. MonzaChev. NovaDodge ColtDodge DiplomDodge MagnumDodge St. ReFord FiestaFord MustangLinc. ContinLinc.
26、Mark VLinc. VersaiMerc. BobcatMerc. CougarMerc. MarquiMerc. MonarcMerc. XR-7Merc. ZephyrOlds 98Olds Cutl SuOlds CutlassOlds Delta 8Olds OmegaOlds StarfirOlds ToronadPlym. ArrowPlym. ChampPlym. HorizoPlym. SapporPlym. VolarePont. CataliPont. FirebiPont. Grand Pont. Le ManPont. PhoeniPont. SunbirAudi
27、5000Audi FoxBMW 320iDatsun 200Datsun 210Datsun 510Datsun 810Fiat StradaHonda AccordHonda CivicMazda GLCPeugeot 604Renault Le CSubaruToyota CelicToyota CorolToyota CoronVW DasherVW DieselVW RabbitVW SciroccoVolvo 260医学资料282.Auto数据回归诊断分析(图象分析) 分析:VW Diesel是数据中唯一的柴油发动机,而Plym. Arrow的数据输入错误.(用这个方法检验数据).医
28、学资料292.Auto数据回归诊断分析(图象分析) avplot graphs an added-variable plot (a.k.a. partial-regression leverage plot, a.k.a. partial regression plot, a.k.a. adjusted partial residual plot) after regression. 医学资料302.Auto数据回归诊断分析(图象分析)Added-variable plot 图象的三个属性:1.图象中是针对每个Xi与Y做出的,数据还是原始数据;2.图象中的直线的斜率和回归模型中Xi的系数相同,
29、同时标准误也和原回归模型一样;3.在原回归模型中影响斜率的每个变量的outlierness(观察值不在拟合直线上的点)保留下来.医学资料31. avplot mpg,bordercoef = 263.18749, se = 110.79612, t = 2.38e( price | X)e( mpg | X )-3.003037.85861-3044.286754.88医学资料32. avplot mpg,border s(make)coef = 263.18749, se = 110.79612, t = 2.38e( price | X)e( mpg | X )-3.003037.8586
30、1-3044.286754.88Linc. VeAMC PaceCad. EldMerc. MaMerc. ZeFord MusRenault Dodge MaHonda CiVW RabbiFiat StrMerc. BoAudi FoxChev. ImPlym. VoVW SciroMerc. CoAMC SpirMerc. MoPont. LePont. GrMerc. XRVW DasheToyota CPont. FiPlym. HoOlds CutOlds CutBuick RiBuick ElMazda GLOlds OmeDodge StDatsun 5Buick SkHond
31、a AcDodge DiPont. PhCad. DevChev. NoBuick CeFord FieDatsun 2Linc. MaBuick ReAMC ConcOlds TorBuick LeBuick OpOlds DelPont. CaToyota CLinc. CoPont. SuDatsun 2Toyota CAudi 500Olds StaChev. MoSubaruChev. MaDatsun 8Chev. MoPlym. SaBMW 320iVW DieseChev. ChVolvo 26Peugeot Dodge CoOlds 98Plym. ChCad. SevPly
32、m. Ar医学资料332.Auto数据回归诊断分析(图象分析) 说明:Cadillac Eldorado ,Lincoln ver,Cadillac Seville 这三个数据很突出.而这三种车占据了100%的奢侈型车的市场.从而说明原来的模型是不恰当的(misspecified).而右下脚的Plymouth Arrow前面说过了,数据输入错误.医学资料342.Auto数据回归诊断分析(图象分析) avplots graphs all the added-variable plots in a single image. 通过这个命令来在一张表格里面看y与每个xi的关系,进一步的分析回归模型,
33、并对原始数据进行检查.医学资料35coef = 4.6135886, se = .7254961, t = 6.36e( price | X)e( weight | X )-516.4781123.52-3033.2510219.8coef = 263.18749, se = 110.79612, t = 2.38e( price | X)e( mpg | X )-3.003037.85861-3044.286754.88coef = -307.21656, se = 108.53072, t = -2.83e( price | X)e( forxmpg | X )-6.439134.5117
34、5-3139.966535.39coef = 11240.331, se = 2751.6808, t = 4.08e( price | X)e( foreign | X )-.13765.288758-3332.077184.12医学资料362.Auto数据回归诊断分析(图象分析) Avplot(s)对于分析outliers很适用,但是不能用于分析变量间的函数关系. Cprplot(component-plus-residual plot)不能分析outliers,但是可以用来检查估计模型的函数形式(直线?曲线?). 相同点:两个图象中的直线斜率都是模型中的系数.医学资料372.Auto数据
35、回归诊断分析(图象分析)重新构建模型:. regress price mpg weight Source | SS df MS Number of obs = 74-+- F( 2, 71) = 14.74 Model | 186321280 2 93160639.9 Prob F = 0.0000 Residual | 448744116 71 6320339.67 R-squared = 0.2934-+- Adj R-squared = 0.2735 Total | 635065396 73 8699525.97 Root MSE = 2514.0- price | Coef. Std.
36、 Err. t P|t| 95% Conf. Interval-+- mpg | -49.51222 86.15604 -0.57 0.567 -221.3025 122.278 weight | 1.746559 .6413538 2.72 0.008 .467736 3.025382 _cons | 1946.069 3597.05 0.54 0.590 -5226.244 9118.382-医学资料38. cprplot mpg,border c(s) bands(13)e( price | X,mpg ) + b*mpgMileage (mpg) Residuals Linear pr
37、ediction1241-4223.686467.19医学资料392.Auto数据回归诊断分析(图象分析) Acprplot(augmented component plus-redisual plot)对检查非线性更加敏感.医学资料40. acprplot mpg,border c(s) bands (13)Augmented component plus residualMileage (mpg) Residuals Linear prediction1241-15384.9-3780.32医学资料412.Auto数据回归诊断分析(图象分析) 现在分析mpg对price的影像是不是线性的.
38、如果给模型新加入一个变量: mpgsq=mpg*mpg,构建回归模型,得到的结果是:医学资料422.Auto数据回归诊断分析(图象分析). gen mpgsq= mpg* mpg. regress price mpg mpgsq weight Source | SS df MS Number of obs = 74-+- F( 3, 70) = 12.70 Model | 223815416 3 74605138.6 Prob F = 0.0000 Residual | 411249980 70 5874999.72 R-squared = 0.3524-+- Adj R-squared =
39、0.3247 Total | 635065396 73 8699525.97 Root MSE = 2423.8- price | Coef. Std. Err. t P|t| 95% Conf. Interval-+- mpg | -981.0308 377.9748 -2.60 0.011 -1734.878 -227.1838 mpgsq | 17.32961 6.859794 2.53 0.014 3.648184 31.01104 weight | .8344929 .7160289 1.17 0.248 -.5935816 2.262567 _cons | 16106.35 659
40、1.341 2.44 0.017 2960.333 29252.36-医学资料432.Auto数据回归诊断分析(图象分析) 比较前后两张表: 1.mpgsq的t检验值是2.53,mpg的t检验值变为-2.60. 2.weight在第二个模型中所发挥的效用只有第一个模型的1/3左右,并且系数是不显著的. 这说明:mpg对price的影响不是线性的.医学资料442.Auto数据回归诊断分析(图象分析) Rvpplot:residual versus predictor plots,如果模型是正确有效的,那么图象中的点就应该是均匀分布而不表现出任何的增加或者减少的趋势.医学资料45. rvpplot
41、 mpg,border yline(0)e( price | X,mpg )Mileage (mpg)1241-3332.467506.95医学资料462.Auto数据回归诊断分析(图象分析) 分析:图象中残差随着mpg增大而减小.这说明模型是有问题的.医学资料473.Exdata数据分析实际应用1.将excel数据导入stata 第一步:将excel文件另存为用制表符隔开的txt 文件; 第二步:用命令: insheet using d:stata/name.txt;2.将stata数据导出用excel打开 第一步:outsheet using d:/stataname .out(生成文件位
42、置) 第二步:用excel打开.out文件即可.医学资料483.Exdata数据分析实际应用 假设1:分类的R&D投入效果存在明显差异; 假设2:低技术类的R&D投入效果一直呈增加 趋势; 假设3:高技术类的R&D投入效果并不存在单 一的增减趋势,在实验的前期呈现减少趋势 而后期将表现为增加趋势。医学资料493.Exdata数据分析实际应用假设假设1:分类的:分类的R&D投入效果存在明显差异;投入效果存在明显差异;使用数据使用数据:insheet using d:/statahvsl.txt. insheet using d:/statahvsl.txt(4 vars, 40 obs). de
43、scribeContains data obs: 40 vars: 4 size: 680 (99.7% of memory free)- storage display valuevariable name type format label variable label-experimentid str10 %10s ExperimentIDperiod byte %8.0g Periodrdoutcomel byte %8.0g R&d outcomelrdoutcomeh byte %8.0g R&d outcomeh-Sorted by: Note: dataset has chan
44、ged since last saved医学资料503.Exdata数据分析实际应用. ttest rdoutcomeh =rdoutcomel Paired t test-Variable | Obs Mean Std. Err. Std. Dev. 95% Conf. Interval-+-rdoutch | 40 3.5 .2454718 1.5525 3.003486 3.996514rdoutcl | 40 12.6 .7602294 4.808113 11.06229 14.13771-+- diff | 40 -9.1 .7661091 4.845299 -10.6496 -7.
45、550398- Ho: mean(rdoutcomeh - rdoutcomel) = mean(diff) = 0 Ha: mean(diff) 0 t = -11.8782 t = -11.8782 t = -11.8782 P |t| = 0.0000 P t = 1.0000医学资料513.Exdata数据分析实际应用假设假设2:低技术类的:低技术类的R&D投入效果一直呈增加趋势;投入效果一直呈增加趋势;insheet using d:/statateam2elow.txt(8 vars, 60 obs) describeContains data obs: 60 vars: 8 si
46、ze: 1,800 (99.4% of memory free)- storage display valuevariable name type format label variable label-experimentid str10 %10s ExperimentIDparticipantid byte %8.0g ParticipantIDperiod byte %8.0g Periodquantity byte %8.0g Quantityrdoutcome byte %8.0g R&d outcomeprice float %9.0g Pricecost float %9.0g
47、Costprofit float %9.0g Profit-Sorted by: Note: dataset has changed since last saved医学资料523.Exdata数据分析实际应用. regress rdoutcome quantity period Source | SS df MS Number of obs = 60-+- F( 2, 57) = 12.42 Model | 96.9864714 2 48.4932357 Prob F = 0.0000 Residual | 222.613529 57 3.9055005 R-squared = 0.3035
48、-+- Adj R-squared = 0.2790 Total | 319.60 59 5.41694915 Root MSE = 1.9762- rdoutcome | Coef. Std. Err. t P|t| 95% Conf. Interval-+- quantity | .261062 .0566845 4.61 0.000 .1475533 .3745708 period | .2021792 .0511249 3.95 0.000 .0998034 .3045549 _cons | -3.575317 1.710724 -2.09 0.041 -7.000983 -.1496
49、508-医学资料53. avplots,bordercoef = .26106203, se = .05668451, t = 4.61e( rdoutcome | X)e( quantity | X )-11.435311.7571-4.673684coef = .20217916, se = .05112486, t = 3.95e( rdoutcome | X)e( period | X )-11.80539.08328-4.212434.34131医学资料543.Exdata数据分析实际应用. ovtestRamsey RESET test using powers of the fi
50、tted values of rdoutcome Ho: model has no omitted variables F(3, 54) = 0.47 Prob F = 0.7058模型没有缺失变量. hettestCook-Weisberg test for heteroskedasticity using fitted values of rdoutcome Ho: Constant variance chi2(1) = 2.00 Prob chi2 = 0.1578在15.78%的显著性下不存在异方差。医学资料553.Exdata数据分析实际应用由以上分析,可以得到结论: 假设2通过医学