13 Statistical Techniques for the Interpretation of Analytical Data 699
Single link (or nearest neighbour) method,d(C
i,Cj) is the smallest distance
between the observations for both clusters (d(Ci,Cj) = min d(−→wi,−→wj),
∀−→wi∈Ci,∀−→wj∈Cj).
Complete link (or furthest neighbour) method,d(C
i,Cj) is the largest dis-
tance between the observations for both clusters (d(Ci,Cj)=max d(−→wi,−→wj),
∀−→wi∈Ci,∀−→wj∈Cj).
Centroid method, d(C
i,Cj) is the distance between the centroids of both clusters
(d(Ci,Cj)=d(ci,cj)).
Average link (or unweighted pair-group average) method, d(C
i,Cj) is calcu-
lated as the average distance between all pairs of observations in the two clusters
(d(Ci,Cj)=mean
{
d(−→wi,−→wj)
}
,∀−→wi∈Ci,∀−→wj∈Cj).
Ward method,that takes into account, within each group, the dispersion of the
observations in relation to the centroid (Ep=
∑
id
(^2) (−→w
i,c
p),∀−→w
i∈Cp). The
clusters (Cpand Cq) are joined, from step (3), ifE(p,q)−Ep−Eqis minimum.
In general, this method is regarded as very efficient, although it tends to create
small clusters.
Before applying these hierarchical methods, the data matrix is usually standard-
ised to give equal importance to all variables. The sequence of steps of the algorithm,
is illustrated graphically in thedendrogram, in which the groups obtained will be
observed.
In the case of clustering of variables, the algorithm is similar, using one minus
the correlation coefficient to measure the distance between variables.
Applications
We have used CA to discover natural groupings of the wine samples and to obtain
a preliminary view of the greatest cause of variation among them (Moreno-Arribas
et al. 1998, 1999; Pozo-Bay ́on et al. 2003b, 2005; Marcobal et al. 2005; Hern ́andez
et al. 2006; Alcaide-Hidalgo et al. 2007). As an example, the application of cluster
analysis to the 10 volatile compounds analyzed in 16 varietal wines (Pozo-Bay ́on
et al. 2001) produces the dendrogram shown in Fig. 13.5, obtained with the STATIS-
TICA program (Cluster Analysisprocedure inMultivariate Exploratory Techniques
module). The squared Euclidean distance was taken as a measure of the proximity
between two samples, and Ward’s method was used as a linkage rule. The variables
were previously standardized. Two groups are observed, one comprised of wines of
the red Monastrell and Trepat varieties, and the other formed by the white varieties
Air ́en and Malvar. The wines of the four varieties are grouped according to variety.
In turn, each of these groups are grouped by year of harvest. It can be observed that
the greatest cause of variation among the samples is due to type of variety, followed
by harvest.
13.3.3 Multivariate Statistical Supervised Techniques
To apply these techniques, we havekgroups with observations in the samepvari-
ables (X 1 ,X 2 , ...,Xp), fromkWipopulations, with mean vectors
−→
μiand covari-
ance matricesi: