Basic Statistics

(Barry) #1

12.5.2 Effect of Outliers from the Regression Line

An outlier is an observation that deviates appreciably from the other observations in
the sample. When a single variable X is measured, we often consider any observation
that is a long way from the median or mean to be a possible outlier. Outliers can be
caused by errors in measurement or in recording data or they may simply be unusual
values. For example, a weight of 400 lb may be a typing error or may be the weight
of an unusually heavy person.
Concern about outliers in regression analysis arises because they may have a large
effect on the estimate of a and b and consequently affect the fit of the line to the
majority of the points. In linear regression, one of the best tools for finding outliers
is to examine the scatter diagram. If one or two outliers result in a line not fitting the
other points, it is often advisable to check each outlier and consider removing it from
the analysis.
Regression outliers have been classified as outliers in X, outliers in Y, and points
that have a large effect on the slope of the line, often called injuentialpoints. Outliers
in Y are located well above or well below the regression line. Many statistical
programs print out the residuals from the line; outliers in Y may be detected by their
large residuals.
Outliers in X are values that are far away from x. Outliers in X possess the
potential for having a large effect on the regression line. If a point is an outlier in
X and is appreciably above or below the line, it is called an injuential value since
it can have a major effect on the slope and intercept. In general, a point that is an
outlier in both X and Y tends to cause more problems in the fit of the line. For further
discussion and illustrations, see Fox and Long [1990], Fox [1991], Chatterjee and
Hadi [1988], or Afifi et al. [2004].

12.5.3 Multiple Regression

In multiple regression one has a single dependent variable Y and several independent
X variables. For example, suppose that one wanted to measure the effect of both
age and weight in combination on systolic blood pressure. When there are two
or more independent X variables, multiple regression is used. Statistical packages
are always used since the computations are more extensive. Options are available
in these programs to assist the user in deciding which X variables to include in
the regression analysis. Note that in general there are numerous excellent texts on
regression analysis, so finding additional books to read is not a problem.

12.1 The following rates are for all deaths from firearms per 100,000 persons in the
United States for the years 1985-1995. The information was taken from the
National Center for Health Statistics, Health, United States 1996-97 and Injury
Free download pdf