Applying word space models to sociolinguistics 115
These two models tend to find totally different types of semantic rela-
tedness. Syntax-based models have proved to be most accurate and efficient
in the calculation of semantic similarity (Lin 1998, Padó and Lapata 2007,
Peirsman, Heylen, and Speelman 2007). They generally model paradigmat-
ic relations between words, like synonymy, hyponymy or hyperonymy.
Document-based models are better geared towards the modeling of syntag-
matic relations, as between doctor and hospital or car and drive (Sahlgren
2006).
In order to make this more concrete, let us take a look at the workings of
a small syntax-based model. Suppose we have at our disposal a large, syn-
tactically analyzed corpus of English, which contains the following sen-
tences:
Every day, he has a glass of red wine before he goes to bed.
I drank too much wine yesterday.
I brought home twenty bottles of red wine from France.
Men drink more beer than women.
I gave him twelve bottles of Belgian beer for his birthday.
He has sworn to never drink beer again.
She bought a new car last year.
My mum prefers red cars to blue ones.
I parked my car a few blocks from your flat.
Say now we are interested in the semantic relatedness between the target
words wine, beer and car. Obviously, we would like to find that beer and
wine are more semantically related to each other than to car. This can be
done by studying the behavior of each target word with respect to a number
of pre-defined contextual features. In theory, contextual features can be any
characteristic of the context that may be relevant to the meaning of the tar-
get word. Semantically ‘empty’ words like have or my are therefore often
ignored. For instance, we might count how often our target words appear in
a specific syntactic relation. In this way we will mostly find paradigmatic
relationships. We store our figures in long lists of frequencies – one for
each target word. These ‘lists’ are referred to as context vectors, and the
contextual features are their dimensions. For our three target words, and
nine syntactic features, our toy corpus gives the following context vectors: