Lecture5缺失值处理策略课件.pptx_163文库

资源描述

1、Outline of the problemMissing values in longitudinal trials is a big issueFirst aim should be to reduce proportionEthics dictate that it cant be avoidedThere is no magic method to fix itMagnitude of problem varies across areas8-week depression trial:25%50%may drop out by final visit12-week asthma tr

2、ial:maybe only 5%10%1 DateName,department2 Outline of the lecturePart I:Missing dataPart II:Multiple imputationExample:The analgesic trial3 4 DateName,department5 Part I:Missing dataIn real datasets,like,e.g.,surveys and clinical trials,it is quite common to have observations with missing values for

3、 one or more input features.The first issue in dealing with the problem is determining whether the missing data mechanism has distorted the observed data.Little and Rubin(1987)and Rubin(1987)distinguish between basically three missing data mechanisms.Data are said to be missing at random(MAR)if the

4、mechanism resulting in its omission is independent of its(unobserved)value.If its omission is also independent of the observed values,then the missingness process is said to be missing completely at random(MCAR).In any other case the process is missing not at random(MNAR),i.e.,the missingness proces

5、s depends on the unobserved values.http:/www.emea.europa.eu/pdfs/human/ewp/177699EN.pdf1.Introduction to missing data?Variables Cases?=missing6 What is missing data?The missingness hides a real value that is useful for analysis purposes.Survey questions:1.What is your total annual income for FY 2008

6、?2.Who are you voting for in the 2009 election for the European parlament?7 What is missing data?Clinical trials:StartFinishcensored at this point in timetime8 MissingnessIt matters why data are missing.Suppose you are modelling weight(Y)as a function of sex(X).Some respondents wouldnt disclose thei

7、r weight,so you are missing some values for Y.There are three possible mechanisms for the nondisclosure:1.There may be no particular reason why some respondents told you their weights and others didnt.That is,the probability that Y is missing may has no relationship to X or Y.In this case our data i

8、s missing completely at random2.One sex may be less likely to disclose its weight.That is,the probability that Y is missing depends only on the value of X.Such data are missing at random3.Heavy(or light)people may be less likely to disclose their weight.That is,the probability that Y is missing depe

9、nds on the unobserved value of Y itself.Such data are not missing at random9 Missing data patterns&mechanisms Pattern:Which values are missing?Mechanism:Is missingness related to the response?(Yi,Ri)=Data matrix,with COMPLETE DATARij=1,Yij missing0,Yij observedRij=Missing data indicator matrix=Obser

10、ved part of Y=Missing part of Y0YimiY10 Missing data patterns&mechanisms“Pattern”concerns the distribution of R“Mechanism”concerns the distribution of R given YRubin(Biometrika 1976)distinguishes between:Missing Completely at Random(MCAR)P(R|Y)=P(R)for all Y Missing at Random(MAR)P(R|Y)=P(R|)for all

11、 Not Missing at Random(NMAR)P(R|Y)depends on0YmYmY11 Missing At Random(MAR)What are the most general conditions under which a valid analysis can be done using only the observed data,and no information about the missingness value mechanism,The answer to this is when,given the observed data,the missin

12、gness mechanism does not depend on the unobserved data.Mathematically,This is termed Missing At Random,and is equivalent to saying that the behaviour of two units who share observed values have the same statistical behaviour on the other observations,whether observed or not.)Y,Y|P(Rom)Y|P(R)Y,Y|P(Ro

13、om12 As units 1 and 2 have the same values where both are observed,given these observed values,under MAR,variables 3,5 and 6 from unit 2 have the same distribution(NB not the same value!)as variables 3,5 and 6 from unit 1.Note that under MAR the probability of a value being missing will generally de

14、pend on observed values,so it does not correspond to the intuitive notion of random.The important idea is that the missing value mechanism can be expressed solely in terms of observations that are observed.Unfortunately,this can rarely be definitively determined from the data at hand!Example13 If da

15、ta are MCAR or MAR,you can ignore the missing data mechanism and use multiple imputation and maximum likelihood.If data are NMAR,you cant ignore the missing data mechanism;two approaches to NMAR data are selection models and pattern mixture.14 Suppose Y is weight in pounds;if someone has a heavy wei

16、ght,they may be less inclined to report it.So the value of Y affects whether Y is missing;the data are NMAR.Two possible approaches for such data are selection models and pattern mixture.Selection models.In a selection model,you simultaneously model Y and the probability that Y is missing.Unfortunat

17、ely,a number of practical difficulties are often encountered in estimating selection models.Pattern mixture(Rubin 1987).When data is NMAR,an alternative to selection models is multiple imputation with pattern mixture.In this approach,you perform multiple imputations under a variety of assumptions ab

18、out the missing data mechanism.In ordinary multiple imputation,you assume that those people who report their weights are similar to those who dont.In a pattern-mixture model,you may assume that people who dont report their weights are an average of 20 pounds heavier.This is of course an arbitrary as

19、sumption;the idea of pattern mixture is to try out a variety of plausible assumptions and see how much they affect your results.Pattern mixture is a more natural,flexible,and interpretable approach.15 Simple analysis strategies(1)Complete Case(CC)analysisAdvantages:Complete Cases?discardEasy Does no

20、t invent dataDisadvantages:InefficientDiscarding data is badCC are often biased samplesWhen some variables are not observed for some of the units,one can omit these units from the analysis.These so-called“complete cases”are then analyzed as they are.16 Analysis strategies(2)Analyze as incomplete(sum

21、mary measures,GEE,)Advantages:Complete Cases?Advantages:Does not invent dataDisadvantagesRestricted in what you can inferMaximum likelihood methods may be computationally intensive or not feasible for certain types of models.17 Analysis strategies(3)Analysis after single imputationAdvantages:Complet

22、e Cases=imputationRectangular fileGood for multiple usersDisadvantages:Nave imputations not goodInvents data-inference is distorted by treating imputations as the truth18 Simple methods of analysis of incomplete datacclocf19 Various strategies20 NotationDROPOUT21 IgnorabilityIn a likelihood setting

23、the term ignorable is often used to refer to MAR mechanism.It is the mechanism which is ignorable-not the missing data!dyyxfxf),()(),|(),|(:00YDPYYDPMARimi 22 Ignorability),|(),|(:00YDPYYDPMARimi 23 Direct likelihood maximisation24 Example 1:Growth data25 26 Growth data27 28 Example:The depression t

24、rialPatients are evaluated both pretreatment and posttreatment with the 17-item Hamilton Rating Scale for Depression(Ham-D-17),29 The depression trial30 31 5.Part II:Multiple imputation32 Data set withmissing valuesResultCompleted set33 34 General principles35 Informal justification36 The algorithm3

25、7 Pooling information38 Hypothesis testing39 40 MI in practice41 MI in practiceA simulation-based approach to missing data1.Generate M 1 plausible versions of .Complete Cases=imputation for Mth dataset2.Analyze each of the M datasets by standard complete-data methods.3.Combine the results across the

26、 M datasets(M=3-5 is usually OK).miY42 MI in practice.Step 1Generate M 1 plausible versions of via software,i.e.obtain M different datasets.An assumption we make:the data are MCAR or MAR,i.e.the missing data mechanism is ignorable.Should use as much information is available in order to achieve the b

27、est imputation.If the percentage of missing data is high,we need to increase M.miY43 How many datasets to create?The efficiency of an estimator based on M imputations is ,where is the fraction of missing information.Efficiency of multiple imputation(%)M 0.1 0.3 0.5 0.7 0.9397918681775989491888510999

28、795939220100 999897961)1(M44 MI in practice.Step 2Analyze each of the M datasets by standard complete-data methods.Let b be the parameter of interest.is the estimate of b from the complete-data analysis of the mth dataset.(m=1 M)is the variance of from the analysis of the mth dataset.mbmUmb45 MI in

29、practice.Step 3Combine the results across the M datasets.is the combined inference for b.Variance for isMmmM1*1bbmbMmTmmMmmMBUMWBMMWV1*11)(1)1(bbbbbetweenwithin46 Software1.Joe Schafers software from his web site.($0)http:/www.stat.psu.edu/%7Ejls/misoftwa.htmlSchafer has written publicly available s

30、oftware primarily for S-plus.There is a stand-alone Windows package for data that is multivariate normal.This web site contains much useful information regarding multiple imputation.47 Software2.SAS software(experimental)It is part of SAS/STAT version 8.02SAS institute paper on multiple imputation,g

31、ives an example and SAS code:http:/ documentation on PROC MIhttp:/ documentation on PROC MIANALYZEhttp:/ Software3.SOLAS version 3.0($1K)http:/www.statsol.ie/index.php?pageID=5Windows based software that performs different types of imputation:Hot-deck imputation Predictive OLS/discriminant regressio

32、n Nonparametric based on propensity scores Last value carried forwardWill also combine parameter results across the M analyses.49 MI Analysis of the Orthodontic Growth Data50 Properties of methodsMCAR:drop-out independent of responseCC is valid,though it ignores informationLOCF is valid if there are

33、 no trends with timeMAR:drop-out depends only on observationsCC,LOCF,GEE invalidMI,MNLM,weighted GEE validMNAR:drop-out depends also on unobservedCC,LOCF,GEE,MI,MNLM invalidSM,PMM valid if(uncheckable)assumptions true51 ReferencesAllison,P.(2002).Missing data.Thousand Oaks,CA:Sage greenback.Horton,N

34、J&Lipsitz,SR.(2001)Multiple imputation in practice:Comparison of software packages for regression models with missing variables.The American Statistician 55(3):244-254.Little,R.J.A.(1992)Regression with missing Xs:A review.Journal of the American Statistical Association 87(420):1227-1237.Roderick J.

35、A.Little and Donald B.Rubin(2002)Statistical Analysis with Missing Data,2nd edition April 2002,Applications of Modern Missing Data Methods,by Roderick J.A.Little.by Joseph L.Schafer Joe Schafers(1997)Analysis of Incomplete Multivariate Data,web site:http:/www.stat.psu.edu/%7Ejls.Anderson,T.W.(1956)M

36、aximum likelihood estimates for a multivariate normal distribution when some observations are missing.52 Further ReferencesLittle,RL&Rubin,DB.(1st ed.1990,2nd ed.2002).Statistical analysis with missing data.New York:Wiley.Rubin,DB.(1987).Multiple imputation for survey nonresponse.New York:Wiley.Mall

37、inckrodt et al.(2003).Assessing and interpreting treatment effects in longitudinal clinical trials with missing data.Biological Psychiatry 53,754760.Gueorguieva&Krystal(2004)Move Over ANOVA.Archives of General Psychiatry 61,310317.Mallinckrodt et al.(2004).Choice of the primary analysis in longitudi

38、nal clinical trials.Pharmaceutical Statistics 3,161169.Molenberghs et al.(2004).Analyzing incomplete longitudinal clinical trial data(with discussion).Biostatistics 5,445464.Cook,Zeng&Yi(2004).Marginal analysis of incomplete longitudinal binary data:a cautionary note on LOCF imputation.Biometrics 60,820-828.53 DateName,department54 Any Questions?

展开阅读全文