linear and both variables are normally distributed. Correlations should always be
examined prior to more sophisticated multivariate analyses such as factor analysis or
principal component analysis. The extent of a linear relationship between two variables
may be difficult to judge from a scatterplot and a correlation coefficient provides a more
succinct summary. However, it would be unwise to attempt to calculate a correlation
when a scatterplot depicted a clear non-linear relationship. When a researcher is
interested in both the extent and the significance of a correlation then r is used in an
inferential way as an estimate of the population correlation, ρ (rho).
Statistical Inference and Null Hypothesis
As well as estimating the size of the population correlation we may want to test whether
it is statistically significant. In testing this hypothesis the same logic is followed as that
described in Chapter 7 when testing the significance of a nonparametric correlation. The
null hypothesis is H 0 : ρ=0, that is, the variable X is not linearly related to the variable Y.
The alternative hypothesis is H 1 : ρ≠0. The null hypothesis is a test of whether any
apparent relationship between the variables X and Y could have arisen by chance. The
sampling distribution of r is not normal when the population correlation deviates from
zero and when sample sizes are small (n<30). For tests of significance r is transformed to
another statistic called Fisher’s z (which is not the same as the Z deviate for a normal
distribution).
Assumptions
In some statistical texts for social scientists it is asserted that to use the Pearson
correlation both variables should have a normal distribution, yet in other texts it says that
the distributions of both variables should be symmetrical and unimodal but not
necessarily normal. These ideas cause great confusion to researchers and need to be
clarified. If the correlation statistic is to be used for descriptive purposes only, then
normality assumptions about the form of the data distributions are not necessary. The
only assumptions required are that
- quantitative measures (interval or ratio level of measurement) are taken simultaneously
on two or more random variables; - paired measurements for each subject are independent.
The results obtained would describe the extent to which a linear relationship would apply
to the sample data.
This same idea applies to the descriptive use of regression statistics. Should the
researcher wish to make any inference about the extent of a population linear relationship
between two variables or in a regression context to make a prediction which went beyond
the sample data, the following assumptions should be met:
- Two random variables should be linearly related, but perfect linearity is not required as
long as there is an obvious linear trend indicated by an elliptical scatter of points
without any obvious curvature (look at the scatterplot). - The underlying probability distribution should be bivariate normal, that is the
distribution of the variable X and the distribution of the variable Y should be normal
Inferences involving continuous data 281