1、Sampling:Design&AnalysisSharon L.LohrArizona State University Contents CHAPTER 1 Introduction 1.1 A Sample Controversy 1.2 Requirements of a Good Sample 1.3 Selection Bias 1.4 Measurement Bias 1.5 Questionnaire Design 1.6 Sampling and Nonsampling Errors CHAPTER 2 Simple Probability Samples 2.1 Types
2、 of Probability Samples 2.2 Framework for Probability Sampling 2.3 Simple Random Sampling 2.4 Confidence Intervals 2.5 Sample Size Estimation 2.6 Systematic Sampling 2.7 Randomization Theory Results for Simple Random Sampling*2.8 A Model for Simple Random Sampling*2.9 When Should a Simple Random Sam
3、ple Be Used?CHAPTER 3 Ratio and Regression Estimation 3.1 Ratio Estimation 3.2 Regression Estimation 3.3 Estimation in Domains 3.4 Models for Ratio and Regression Estimation*3.5 Comparison CHAPTER 4 Stratified Sampling 4.1 What Is Stratified Sampling?4.2 Theory of Stratified Sampling 4.3 Sampling We
4、ights 4.4 Allocating Observations to Strata 4.5 Defining Strata 4.6 A Model for Stratified Sampling*4.7 Post-stratification 4.8 Quota Sampling CHAPTER 5 Cluster Sampling with Equal Probabilities 5.1 Notation for Cluster Sampling 5.2 One-Stage Cluster Sampling 5.3 Two-Stage Cluster Sampling 5.4 Using
5、 Weights in Cluster Samples 5.5 Designing a Cluster Sample 5.6 Systematic Sampling 5.7 Models for Cluster Sampling*CHAPTER 1 Introduction When statistics are not based on strictly accurate calculations,they mislead instead of guide.The mind easily lets itself be taken in by the false appearance of e
6、xactitude which statistics retain in their mistakes,and confidently adepts errors clothed in the form of mathematical truth.-Alexis de Tocqueville,DEMOCRACY IN AMERICA 1.1 A Sampling Controversy Shere Hites book“Women and Love:A Cultural Revolution in progress”(1987):84%of women are not satisfied em
7、otionally with their relationships(p804).70%of all women married five or more years are having sex outside of their marriages(p856).95%of women report forms of emotional and psychological harassment from men with whom they are in love relationships(p810).84%of women report forms of condescension fro
8、m the men in their love relationships(p809).-*Harassment:to annoy persistentlysexual harassment:uninvited and unwelcome verbal or physical behavior of a sexual nature especially by a person in authority toward a subordinate(as an employee or student)*Condescension:1.voluntary descent from ones rank
9、or dignity in relations with an inferior;2.The act of condescending or an instance of it.3.Patronizingly superior behavior or attitude.*Vignette:A decorative design placed at the beginning or end of a book or chapter of a book or along the border of a page.The following characteristics of the survey
10、 make Hites result unsuitable for generali-zation.The sample was self-selected.The questionnaires were mailed to specific groups.The questions in the survey are too complicated.Many of the questions are vague,using words such as love.Many of the questions are leading.1.2 Requirements of a Good Sampl
11、e A perfect sample should:1.be a scaled-down version of the population;2.can mirror characteristics of the whole population Some definitions to make the notion of a good sample more precise:Observation unit An object on which a measurement is taken.Target population The complete collection of observ
12、ations we want to study.Sample A subset of a population.Sampled population The collection of all possible observation units that might have been chosen in a sample.The population from which the sample was taken.Sampling unit The unit we actually sample.Sampling frame The list of sampling units.Targe
13、t population Sampling frame population Sampled population NotreachableRefuse torespondNot capable to respondNot eligible for survey In an ideal survey,the sampled population will be identical to the target population,but this ideal is rarely met exactly.In the Hite studyTarget population:all adult w
14、omen in the United StatesSampled population:women belonging to womens organizations who would return the questionnaire.In the National Crime Victimization Survey:Target population:all households in the United StatesSampled population:households in the sampling frame that are at home and agree to ans
15、wer questions.In the National Pesticide Survey:Target population:all community water systems and rural domestic wells in the United States.Sampled population:all community water systems and all identifiable domestic wells outside of government reservations that belonged to households willing to coop
16、erate with the survey.In public opinion polls:Target population:persons who will vote in the next electionSampled population:persons who can be reached by telephone and say they are likely to vote in the next election 1.3 Selection Bias The following examples indicate some ways in which selection bi
17、as can occur:Use a sample-selection procedure that,unknown to the investigators,depends on some characteristic associated with the properties of interest.Deliberately or purposefully select a representative sample.for instance:”a judgment sample”Misspecify the target population.Fail to include all t
18、he target population in the sampling frame,called under-coverage.Substitute a convenient member of a population for a designated member who is not readily available.Fail to obtain responses from the entire chosen sample.This is called non-responseAllow the sample to consist entirely of volunteers CA
19、SE STUDY Literary Digest An ever very famous magazine in USA who began taking polls to forecast the outcome of the USA presidential election in 1912.their polls attained a reputation for accuracy because they forecast the correct winner in every election between 1912 and 1932.In 1932,for example,the
20、 poll predicted that Roosevelt would receive 56%of the popular vote and 474 votes in the electoral college;in the actual election.Roosevelt received 58%of the popular vote and 472 votes in the electoral college.Electoral college:(in the U.S.)a body of people representing the states of the U.S.,who f
21、ormally cast votes for the election of the president and vice president.On October 31,1936,the poll predicted The outcome is:Republican Alf Landon:55%President Roosevelt:41%Republican Alf Landon:37%President Roosevelt:61%Two reasons that accounted for the outcome:One problem may have been undercover
22、age in the sampling frame,which relied heavily on telephone directories and automobile registration list;The low response rate(less than 25%)to the survey was likely the source of much of the error.One lesson to be learnt from the Literary Digest poll is that the sheer size of a sample is no guarant
23、ee of it accuracy 1.4 Measurement BiasIn following cases,it is most likely to happen:People sometimes do not tell the truth.People do not understand the questionsPeople forgetPeople give different answers to different interviewersPeople cater to the interviewersThe interviewer may have his own incli
24、nation to the surveyCertain words may have vague meaningThe questionnaire doesnt word well or is not arranged in a good order 1.5 Questionnaire DesignDecide what you want to find outAlways test your questions before taking the surveyKeep it simple and clearUse specific questions instead of general o
25、nesRelate your questions to the concept of interest.Decide whether to use open or closed questions(open questions:the respondents is not prompted with categories for response;closed ones:multiple choices)Report the actual question askedAvoid questions that prompt or motivate the respondent to say wh
26、at you would like to hearUse forced-choice,rather than agree or disagree questionsAsk only one concept in each questionPay attention to question-order effects 1.6 sampling and nonsampling errorssampling errorsThe error that results from taking one sample instead of examining the whole populationnons
27、ampling errorsThe error that can not be attributed to the sample-to-sample variability,caused chiefly by following causes:Selection biasIncorrect answersIncomplete valueNonresponseSelection bias and measurement bias are examples of nonsampling errors In a lot of cases,nonsampling errors may have muc
28、h worse effect on accuracy of the sample than sampling ones Why do we use sampling?Sampling can provide reliable information at far less cost than a census.Data can be collected more quickly,so estimates can be published in a timely fashion.Finally,and less well known,estimates based on sample surve
29、ys are often more accurate than those based on a census because investigators can be more careful when collecting data CHAPTER 2 Simple Probability Samples Probability Sampling:in a probability sample,each unit in the population has a known(but not certainly equal)probability of selection,and a chan
30、ce method such as using numbers from a random number table is used to choose the specific units to be included in the sample.2.1 Types of Probability Samples1,Simple random sample2,Stratified sample3,Cluster sample A simple random sample(SRS)is the simplest form of probability sample.An SRS of size
31、n is taken when every possible subset of n units in the population has the same chance of being the sample.In a stratified random sample,the population is divided into subgroups called strata.Then an SRS is selected from each stratum,and the SRSs in the strata are selected independently.In a cluster
32、 sample,observation units in the population are aggregated into larger sampling units,called clusters.Then an SRS is drawn under the condition that each cluster is viewed as a unit.2.2 Framework for Probability Sampling A special case for it is N=4,which results in:Its possible samples(n=2)are:1,2,3
33、,.,UN11,2S 21,3S31,4S 42,3S52,4S 63,4S1,2,3,4U Example 2.1:1()1/3P S2()1/6P S(unitin sample)iiP4()0P S5()0P S6()1/2P S3()0P S1123()()()1/2001/2P SP SP S2145()()()1/3001/3P SP SP S3246()()()1/60 1/22/3P SP SP S4356()()()00 1/21/2P SP SP S Example 2.2:1,2,3,4,5,6,7,8U:()()SS tktkP SPiiy1,2,3,4,5,6,7,8
34、1,2,4,4,7,7,7,8 The expected value of ,is the mean of the sampling distribution of :()()*()SSSS tkS tkE tP S tkP tkk22 28 30 32 34 36 38 40 42 44 46 48 50 52 58,i.e()tE tt70*iP1 6 2 3 7 4 6 12 6 4 7 3 2 6 1161()22285840707070E t 22:()()()()SS tkV tEtE tP StE t The variance of the sampling distributi
35、on of ,i.e ,is:t2211()2240584054.867070V t()()54.867.41SE tV t()V t22MSE()()()tEttEtE tE tt The Mean Squared Error(MSE)rather than variance to measure the accuracy of an estimator is:22()()EtE tEE tt2()()EtE tE tt22()()EtE tEE tt2BiasV tt E tt An estimator is unbiased if:2()V tE tE t An estimator is
36、 precise if the following is small:An estimator is accurate if the following is small:2MSE tE tt 1Niity Some indicators for the population:The population total is:The mean of the population is:The variance of the population values about the mean is:11NNUiiyy22U11SyN-1Niiy The standard deviation of t
37、he population values about the mean is:22U11S=SyN-1Niiy The coefficient of variation(CV)is:UUCV()=0ySyy Proportion is a special case of mean:The distinction between mean and proportion is:in the case of mean,the variable can take more than two values;whereas in proportion case,it can take and can on
38、ly take two values.11000011112222nnkkkkxxxxnnxp Where the variable is:n 1100001111nxxxpnn 1,if the th unit has the specific character0,otherwiseiix(0n)i 2.3 Simple Random SamplingThere are two types of Simple Random Sample:1 Simple Random Sample with replacement(SRSWR).In this case,there are possibl
39、e samples and we may get duplicates;*nnNNNN 2 Simple Random Sample with replacement(SRS).In this case,there are possible samples and we may not get duplicates.!()!nNNnNn For estimating the population mean in an SRS,we use the sample mean:The is an unbiased estimator of the population mean ,and the v
40、ariance of is:1nSii SyyUyyUyy2()1nSnV yN In which is called the finite population correction(fpc):1n N For estimating the population variance ,we use the sample variance:An unbiased estimator of is as follow:2S()V y2V()1nsnyN221n-1ii Ssyy But the estimated variance of is usually reported by its stan
41、dard error(SE):2SE()V()1nsnyyNy The estimated coefficient of variation of an estimate is:V()SE()CV()yyyyy All this results apply to the estimation of a population total,t,since:The unbiased estimator of t is:1NiUityNytNy Its variance is:222V()V()1nStNyNNn But the unbiased estimator of this variance
42、is:22V()1nstNNn since22221112SN-1N-1NNNiiiiiiypypyNp1NiiUyypN As for proportion variable,the parameters are:211NNiiiiyyNp thus22(1)N 11NppNpNpNppN The estimators are:222(1)n-11ii SypnppSsn py2V()1nSpNn1(1)(1)11NnNppNnppNnNNn Where:2.4 Confidence Intervals Used to indicate how accurate our estimates
43、are.Usually appears in this way:22(),()xzSE xxzSE xIf we take as an example,then we have:22(),()yzSE yyzSE yy The distinction between distribution and sampling distribution from it:A“distribution”refers to the original distribution of a variable y,whereas a“sampling distribution”refers to the distri
44、bution generated from the original one,like the distribution of and .Example 2.1 ys:1,2,4,4,7,7,7,8y:y11/4,14/4,15/4,16/4,17/4,29/4 With a uniform distribution,only 8 cases.With a non-uniform distribution,up to 15 cases.The distinction between law of large numbers and central limit theorem:“law of l
45、arge numberslaw of large numbers”says that there is almost no difference between sample and population mean if n is sufficiently large,both dependently or independently,with the same or different distribution;whereas“central central limit theoremlimit theorem”says that the distribution of any sample
46、 mean converges to normal distribution if n is sufficiently large,with the same or different distribution.2122.1limP()()dx2tynnXXXnyyenlimP()1npnBernoullis law of large numbers is:Linderberg and levys central limit theorem is:2.5 Sample Size EstimationAn investigator often measures several variables
47、 and has a number of goals for a survey.Anyone designing an SRS must decide what amount of sampling error in the estimates is tolerable and must balance the precision of the estimates with the cost of the survey.Follow these steps to estimate the sample size:Ask questions as:A:What is expected of th
48、e sample?B:How much precision do I need?C:What are the consequences of the sample results?D:How much error is tolerable?Find an equation relating the sample size n and our expectations of the sample Estimate any unknown quantities and solve for n.If the sample size you calculated in last step is muc
49、h larger than you can afford.Go back and adjust some of the expec-tations for the survey and try again.Specify the tolerable error P()1Uyye or P()1UUyyey Find an equationSolving for n,we have:In relative precision case,we have 21nSezNn2220220221zSnnnzSeNN22222222222222CV()CV()UzSzynzSzyeyeNN2.7 Rand
50、omization Theory Results for Simple Random Sampling*To verify and Define ,then we have UE yy2 1nSV yNn1if unit is in thesampl0otherwiseiieZ1Niiii SiyyyZnn P(1)select unit in sampleiZPinumber of samples including unit number of all possible samplesi11NnnNNn As a consequence of equation(2.18)in order