50 Mathematical Ideas You Really Need to Know

(Marcin) #1

36 Connecting data


How are two sets of data connected? Statisticians of a hundred years ago thought they
had the answer. Correlation and regression go together like a horse and carriage, but
like this pairing, they are different and have their own jobs to do. Correlation measures
how well two quantities such as weight and height are related to each other.
Regression can be used to predict the values of one property (say weight) from the
other (in this case, height).


Pearson’s correlation


The term correlation was introduced by Francis Galton in the 1880s. He
originally termed it ‘co-relation’, a better word for explaining its meaning. Galton,
a Victorian gentleman of science, had a desire to measure everything and applied
correlation to his investigations into pairs of variables: the wing length and tail
length of birds, for instance. The Pearson correlation coefficient, named after
Galton’s biographer and protégé Karl Pearson, is measured on a scale between
minus one and plus one. If its numerical value is high, say +0.9, there is said to
be a strong correlation between the variables. The correlation coefficient
measures the tendency for data to lie along a straight line. If it is near to zero the
correlation is practically non-existent.
We frequently wish to work out the correlation between two variables to see
how strongly they are connected. Let’s take the example of the sales of
sunglasses and see how this relates to the sales of ice creams. San Francisco
would be a good place in which to conduct our study and we shall gather data
each month in that city. If we plot points on a graph where the x (horizontal)
coordinate represents sales of sunglasses and the y (vertical) coordinate gives the
sales of ice creams, each month we will have a data point (x, y) representing
both pieces of data. For example, the point (3, 4) could mean the May sales of
sunglasses were $30,000 while sales of ice creams in the city were $40,000 in
that same month. We can plot the monthly data points (x, y) for a whole year on
a scatter diagram. For this example, the value of the Pearson correlation
coefficient would be around +0.9 indicating a strong correlation. The data has a

Free download pdf