The Art of R Programming

5.2.5 Extended Example: A Salary Study...............................

In a study of engineers and programmers, I considered the question, “How many of these workers are the best and the brightest—that is, people of extraordinary ability?” (Some of the details have been changed here.) The government data I had available was limited. One (admittedly imperfect) way to determine whether a worker is of extraordinary ability is to look at the ratio of actual salary to the government prevailing wage for that job and location. If that ratio is substantially higher than 1.0, you can reasonably assume that this worker has a high level of talent. I used R to prepare and analyze the data and will present excerpts of my preparation code here. First, I read in the data file:

all2006 <- read.csv("2006.csv",header=TRUE,as.is=TRUE)

The functionread.csv()is essentially identical toread.table()except that the input data is in the CSV format exported by spreadsheets, which is the way the data set was prepared by the US Department of Labor (DOL). Theas.isargument is the negation ofstringsAsFactors, which you saw ear- lier in Section 5.1. So, settingas.istoTRUEhere is simply an alternate way to achievestringsAsFactors=FALSE. At this point, I had a data frame,all2006, consisting of all the data for the year 2006. I then did some filtering:

all2006 <- all2006[all2006$Wage_Per=="Year",] # exclude hourly-wagers all2006 <- all2006[all2006$Wage_Offered_From > 20000,] # exclude weird cases all2006 <- all2006[all2006$Prevailing_Wage_Amount > 200,] # exclude hrly prv wg

These operations are typical data cleaning. Most large data sets contain some outlandish values—some are obvious errors, others use different mea- surement systems, and so on. I needed to remedy this situation before doing any analysis. I also needed to create a new column for the ratio between actual wage and prevailing wage:

all2006$rat <- all2006$Wage_Offered_From / all2006$Prevailing_Wage_Amount

Since I knew I would be calculating the median in this new column for many subsets of the data, I defined a function to do the work:

medrat <- function(dataframe) { return(median(dataframe$rat,na.rm=TRUE)) }

Note the need to exclude NA values, which are common in government data sets.

108 Chapter 5

The Art of R Programming

5.2.5 Extended Example: A Salary Study...............................

Get our desktop app

Company

Features

Documentation

Resources