The Art of R Programming

(WallPaper) #1

10.2.4 Extended Example: Reading PUMS Census Files...................


The U.S. Census Bureau makes census data available in the form of Public
Use Microdata Samples (PUMS). The termmicrodatahere means that we
are dealing with raw data and each record is for a real person, as opposed to
statistical summaries. Data on many, many variables are included.
The data is organized by household. For each unit, there is first a
Household record, describing the various characteristics of that household,
followed by one Person record for each person in the household. Charac-
ter positions 106 and 107 (with numbering starting at 1) in the Household
record state the number of Person records for that household. (The num-
ber can be very large, since some institutions count as households.)
To enhance the integrity of the data, character position 1 contains H or
P to confirm that this is a Household or Person record. So, if you read an
H record, and it tells you there are three people in the household, then the
following three records should be P records, followed by another H record;
if not, you’ve encountered an error.
As our test file, we’ll take the first 1,000 records of the year 2000 1 per-
cent sample. The first few records look like this:


H000019510649 06010 99979997 70 631973
15758 59967658436650000012000000 0000000000000000
0 0 0 0 0 0 0 0000 0 0 0 0 0 00000000000000000000000000000
00000000000000000000000000
P00001950100010923000420190010110000010147050600206011099999904200000 0040010000
00300280 28600 70 9997 9997202020202020220000040000000000000006000000
00000 00 0000 00000000000000000132241057904MS 476041-20311010310
07000049010000000000900100000100000100000100000010000001000139010000490000
H000040710649 06010 99979997 70 631973
15758 599676584365300800200000300106060503010101010102010 01200006000000100001
00600020 0 0 0 0 0000 0 0 0 0 0 02000102010102200000000010750
02321125100004000000040000
P00004070100005301000010380010110000010147030400100009005199901200000 0006010000
00100000 00000 00 0000 0000202020202020220000040000000000000001000060
06010 70 9997 99970101004900100000001018703221 770051-10111010500
40004000000000000000000000000000000000000000000000000000004000000040000349
P00004070200005303011010140010110000010147050000204004005199901200000 0006010000
00100000 00000 00 0000 000020202020 0 0200000000000000000000000050000
00000 00 0000 000000000000000000000000000000000000000000-00000000000
000 0 0 0 0 0 0 0 0 00000000349
H000061010649 06010 99979997 70 631973
15758 599676584360801190100000200204030502010101010102010 00770004800064000001
1 0 030 0 0 0 0340 00660000000170 0 06010000000004410039601000000
00021100000004940000000000


The records are very wide and thus wrap around. Each one occupies
four lines on the page here.


Input/Output 239
Free download pdf