Statistical Analysis for Education and Psychology Researchers

(Jeff_L) #1
Test Assumptions for Regression Analysis

Inferences based on hypothesis tests for the regression slope, intercept, and overall model
fit require the following assumptions to be met: (The reader should also look up the
assumptions required for the Pearson correlation, r, with particular reference to the
distinction between descriptive and inferential use of regression and correlation
statistics.)



  • Data should consist of measures on at least a pair of variables, a response variable Y and
    an explanatory variable X. Measurement of the response variable should be at least
    theoretically continuous. (It is possible for example to use scores on a rating scale; 0,
    1, 2, 3...n), and in multiple regression one or more of the explanatory variables may
    be binary (in regression these are called dummy variables, for example, the binary
    variable sex may be coded 0=male, 1=female).

  • The relationship between response and explanatory variables should be approximately
    linear. (Verify by plotting the response variable against each independent variable in
    the model. Strong correlation is indicated by an obvious straight line trend in the
    scatter of points. To check for correlations between independent variables in multiple
    regression plot pairs of independent variables. The computed correlation also
    indicates the strength of any linear relationship—see section 8.3.)

  • The error term in the regression model, ε, should have a normal probability distribution.
    The residuals in a regression analysis represent the sample estimates of the error
    terms. These should have a mean of zero and constant variance (this is called
    homoscedasticity). Note that neither the response variable or the explanatory variables
    are required to have a normal distribution, it is the fitted residuals that should be
    normal. (Verify the normality assumption by doing a normal probability plot of
    residuals. The distribution of residuals only provides an indication of the underlying
    error distribution in the population and may be unreliable with small sample sizes.
    Interpret the normal probability plot in the same way as described in Chapter 5
    section 5.5 ‘Checking for Normality’.
    Verify the assumption of constant variance by plotting residuals against predicted
    values. A random scatter of points about the mean of zero indicates constant
    variance and satisfies this assumption. A funnel shaped pattern indicates
    nonconstant variance. Outlier observations are easily spotted on this plot.)

  • The error terms (residuals) associated with pairs of Y and X variables should be
    independent. (Verify by checking that each pair of measurements comes from a
    different independent subject, i.e., no repeated measures on the same subject.
    If data is collected over time there may be a time series (trend) in the data (data
    points close in time may be more highly correlated and certainly not
    independent). Verify by plotting residuals against case number (ID).)

  • The model should be adequate and correctly specified. This is strictly not an assumption
    but part of the diagnostic procedure for checking model fit. (Verify model fit and the
    possible requirement for more terms in the model, such as a quadratic term (the value
    of an independent variable squared) or more variables by using an overlay plot of
    predicted values vs. values of the independent variable (this gives the linear fitted


Inferences involving continuous data 257
Free download pdf