We’ll create a function calledextractpums()to read in a PUMS file and
create a data frame from its Person records. The user specifies the filename
and lists fields to extract and names to assign to those fields.
We also want to retain the household serial number. This is good to
have because data for persons in the same household may be correlated and
we may want to add that aspect to our statistical model. Also, the household
data may provide important covariates. (In the latter case, we would want to
retain the covariate data as well.)
Before looking at the function code, let’s see what the function does.
In this data set, gender is in column 23 and age in columns 25 and 26. In
the example, our filename ispumsa. The following call creates a data frame
consisting of those two variables.
pumsdf <- extractpums("pumsa",list(Gender=c(23,23),Age=c(25,26)))
Note that we are stating here the names we want the columns to have
in the resulting data frame. We can use any names we want—say Sex and
Ancientness.
Here is the first part of that data frame:
> head(pumsdf)
serno Gender Age
2 195 2 19
3 407 1 38
4 407 1 14
5 610 2 65
6 1609 1 50
7 1609 2 49
The following is the code for theextractpums()function.
1 # reads in PUMS file pf, extracting the Person records, returning a data
2 # frame; each row of the output will consist of the Household serial
3 # number and the fields specified in the list flds; the columns of
4 # the data frame will have the names of the indices in flds
5
6 extractpums <- function(pf,flds) {
7 dtf <- data.frame() # data frame to be built
8 con <- file(pf,"r") # connection
9 # process the input file
10 repeat {
11 hrec <- readLines(con,1) # read Household record
12 if (length(hrec) == 0) break # end of file, leave loop
13 # get household serial number
14 serno <- intextract(hrec,c(2,8))
240 Chapter 10