The Art of R Programming

(WallPaper) #1

5.2.5 Extended Example: A Salary Study...............................


In a study of engineers and programmers, I considered the question, “How
many of these workers are the best and the brightest—that is, people of
extraordinary ability?” (Some of the details have been changed here.)
The government data I had available was limited. One (admittedly
imperfect) way to determine whether a worker is of extraordinary ability is
to look at the ratio of actual salary to the government prevailing wage for
that job and location. If that ratio is substantially higher than 1.0, you can
reasonably assume that this worker has a high level of talent.
I used R to prepare and analyze the data and will present excerpts of my
preparation code here. First, I read in the data file:

all2006 <- read.csv("2006.csv",header=TRUE,as.is=TRUE)

The functionread.csv()is essentially identical toread.table()except
that the input data is in the CSV format exported by spreadsheets, which is
the way the data set was prepared by the US Department of Labor (DOL).
Theas.isargument is the negation ofstringsAsFactors, which you saw ear-
lier in Section 5.1. So, settingas.istoTRUEhere is simply an alternate way to
achievestringsAsFactors=FALSE.
At this point, I had a data frame,all2006, consisting of all the data for
the year 2006. I then did some filtering:

all2006 <- all2006[all2006$Wage_Per=="Year",] # exclude hourly-wagers
all2006 <- all2006[all2006$Wage_Offered_From > 20000,] # exclude weird cases
all2006 <- all2006[all2006$Prevailing_Wage_Amount > 200,] # exclude hrly prv wg

These operations are typical data cleaning. Most large data sets contain
some outlandish values—some are obvious errors, others use different mea-
surement systems, and so on. I needed to remedy this situation before doing
any analysis.
I also needed to create a new column for the ratio between actual wage
and prevailing wage:

all2006$rat <- all2006$Wage_Offered_From / all2006$Prevailing_Wage_Amount

Since I knew I would be calculating the median in this new column for
many subsets of the data, I defined a function to do the work:

medrat <- function(dataframe) {
return(median(dataframe$rat,na.rm=TRUE))
}

Note the need to exclude NA values, which are common in government
data sets.

108 Chapter 5

Free download pdf