about how the data came into our possession. The last part of this book deals with statistical inference—
making statements about a population based on samples drawn from the population. In both data analysis
and inference, we would like to believe that our analyses, or inferences, are meaningful. If we make a
claim about a population based on a sample, we want that claim to be true. Our ability to do meaningful
analyses and make reliable inferences is a function of the data we collect. To the extent that the sample
data we deal with are representative of the population of interest, we are on solid ground. No
interpretation of data that are poorly collected or systematically biased will be meaningful. We need to
understand how to gather quality data before proceeding on to inference. In this chapter, we study
techniques for gathering data so that we have reasonable confidence that they are representative of our
population of interest.
Census
We usually want to know something about the entire population of interest. The way to find that out for
sure is to conduct a census , a procedure by which every member of a population is selected for study.
Doing a census, especially when the population of interest is quite large, is often impractical, too time
consuming, or too expensive. Interestingly enough, relatively small samples can give quite good estimates
of population values if the samples are selected properly. For example, it can be shown that
approximately 1500 randomly selected voters can give reliable information about the entire voting
population of the United States.
The goal of sampling is to produce a representative sample , one that has the essential
characteristics of the population being studied and is free of any type of bias. We can never be certain that
our sample has the characteristics of the population from which it was drawn. Our best chance of making
a sample representative is to use some sort of random process in selecting it. It is important to note that
“bias” does not mean the same thing as “nonrepresentative.” Bias refers to a method that produces
estimates that are either too high on average, or too low on average. Nonrepresentative refers to a
particular sample that differs from the population.
Probability Sample
A list of all members of the population from which we can draw a sample is called a sampling frame .
We would like the sampling frame to be the same set of individuals we are studying. Unfortunately, this is
often not the case. (Think, for example, about how selecting individuals from a phonebook is not the same
as all adult residents of a city!) A probability sample is one in which each member of the population has
a known probability of being in the sample. Each member of the population may or may not have an equal
chance of being selected. Probability samples are used to avoid the bias that can arise in a nonprobability
sample (such as when a researcher selects the subjects she will use). Probability samples use some sort
of random mechanism to choose the members of the sample. The following list includes some types of
probability samples.
• random sample : Each member of the population is equally likely to be included.
• simple random sample (SRS) : A sample of a given size is chosen in such a way that every possible
sample of that size is equally likely to be chosen. Note that a sample can be a random sample and not
be a simple random sample (SRS). For example, suppose you want a sample of 64 NFL football
players. One way to produce a random sample would be to randomly select two players from each of
the 32 teams. This is a random sample but not a simple random sample because not all possible
samples of size 64 are possible.
• systematic sample : The first member of the sample is chosen according to some random procedure,