Test Assumptions for Regression Analysis
Inferences based on hypothesis tests for the regression slope, intercept, and overall model
fit require the following assumptions to be met: (The reader should also look up the
assumptions required for the Pearson correlation, r, with particular reference to the
distinction between descriptive and inferential use of regression and correlation
statistics.)
- Data should consist of measures on at least a pair of variables, a response variable Y and
an explanatory variable X. Measurement of the response variable should be at least
theoretically continuous. (It is possible for example to use scores on a rating scale; 0,
1, 2, 3...n), and in multiple regression one or more of the explanatory variables may
be binary (in regression these are called dummy variables, for example, the binary
variable sex may be coded 0=male, 1=female). - The relationship between response and explanatory variables should be approximately
linear. (Verify by plotting the response variable against each independent variable in
the model. Strong correlation is indicated by an obvious straight line trend in the
scatter of points. To check for correlations between independent variables in multiple
regression plot pairs of independent variables. The computed correlation also
indicates the strength of any linear relationship—see section 8.3.) - The error term in the regression model, ε, should have a normal probability distribution.
The residuals in a regression analysis represent the sample estimates of the error
terms. These should have a mean of zero and constant variance (this is called
homoscedasticity). Note that neither the response variable or the explanatory variables
are required to have a normal distribution, it is the fitted residuals that should be
normal. (Verify the normality assumption by doing a normal probability plot of
residuals. The distribution of residuals only provides an indication of the underlying
error distribution in the population and may be unreliable with small sample sizes.
Interpret the normal probability plot in the same way as described in Chapter 5
section 5.5 ‘Checking for Normality’.
Verify the assumption of constant variance by plotting residuals against predicted
values. A random scatter of points about the mean of zero indicates constant
variance and satisfies this assumption. A funnel shaped pattern indicates
nonconstant variance. Outlier observations are easily spotted on this plot.) - The error terms (residuals) associated with pairs of Y and X variables should be
independent. (Verify by checking that each pair of measurements comes from a
different independent subject, i.e., no repeated measures on the same subject.
If data is collected over time there may be a time series (trend) in the data (data
points close in time may be more highly correlated and certainly not
independent). Verify by plotting residuals against case number (ID).) - The model should be adequate and correctly specified. This is strictly not an assumption
but part of the diagnostic procedure for checking model fit. (Verify model fit and the
possible requirement for more terms in the model, such as a quadratic term (the value
of an independent variable squared) or more variables by using an overlay plot of
predicted values vs. values of the independent variable (this gives the linear fitted
Inferences involving continuous data 257