Statistical Methods for Psychology

The first step in examining the data has already been carried out in Figure 15.1 with graphical presentations of important variables. At that point we noted that most of the variables were fairly messy with the percentage of students taking the SAT being decidedly bimodal. SAT scores were also somewhat bimodal, and much of that can probably related to the bimodal nature of PctSAT. For reasons that will become clear shortly we used the log of PctSAT rather than PctSAT itself. This at least had the effect of reducing the curvilinear relationship between the SAT scores and the percentage of students in each state taking the SAT. None of our variables had extreme outliers, especially after we used a log transforma- tion of PctSAT. The fact that we don’t have more outliers when we look at the variables individually does not necessarily mean that all is well. There is still the possibility of having multivariate outliers.A case might seem to have reasonable scores on each of the variables taken sepa- rately but have an unusual combinationof scores on two or more variables. For example, it is not uncommon to be 6 feet tall, nor is it uncommon to weigh 125 pounds. But it clearly would be unusual to be 6 feet tall andweigh 125 pounds. Having temporarily satisfied ourselves that the data set does not contain unreasonable data points and that the distributions are not seriously distorted, a useful second step is to conduct a preliminary regression analysis using all the variables, as we have done. I say “preliminary” because the point here is to use that analysis to examine the data rather than as an end in itself. Instead of jumping directly into the educational expenditure data set, we will first in- vestigate diagnostic tools with a smaller data set created to illustrate the use of those tools. These data are shown below and are plotted in Figure 15.5.

X: 1133345 5 761013 Y: 123576810105 414

The three primary classes of diagnostic statistics, each of which is represented in Figure 15.5, are 1.Distance,which is useful in identifying potential outliers in the dependent variable (Y). 2.Leverage (hi),which is useful in identifying potential outliers in the independent variables ( ). 3.Influence,which combines distance and leverage to identify unusually influential ob- servations. An observation is influential if the location of the regression surface would change markedly depending on the presence or absence of that observation. Our most common measure of distance is the residual ( ). It measures the verti- cal distance between any point and the regression line. Points Aand Cin Figure 15.5 have

Yi 2 YNi

X 1 , X 2 ,... , Xp

540 Chapter 15 Multiple Regression

Figure 15.5 Scatterplot of Yon X

15 12 9 Y

X

C

A

B

6 3 0 0 3 6 9 12 15

multivariate
outliers

Distance

Leverage (hi)

Influence

Statistical Methods for Psychology

Get our desktop app

Company

Features

Documentation

Resources