6.2 Vector Space Retrieval 137
term frequency, inverse document frequency(TFIDF) weighting scheme. This
scheme is by far the most commonly used term-weighting scheme informa-
tion retrieval using the vector space model.
The interpretation of documents and queries as vectors gives a versatile
and intuitive geometric approach for information retrieval. The analysis
above made a number of assumptions which generally do not hold in prac-
tice. However, the vector space model can be adjusted to some extent when
these assumptions do not hold.
- Documents can be decomposed into statistically independent terms.Statistical
dependencies among terms are manifest in the vector space model as axes
in the vector space that are not orthogonal. For example, “integrin” oc-
curs in about 28,000 PubMed citations, “primate” occurs in about 116,000
citations, and “fibronectin” occurs in about 23,000 citations. Given that
PubMed has about 15 million citations, one would expect that about 200
citations would have both “integrin” and “primate” and that about 40 ci-
tations would have both “integrin” and “fibronectin.” In fact, there are
about 300 citations in the former case and almost 5000 citations in the
latter. This suggests that “integrin” and “primate” are nearly indepen-
dent, but that “integrin” and “fibronectin” are significantly correlated.
One can incorporate correlations into the vector space model by using
a nonorthogonal basis. In other words, the terms are no longer geometri-
cally at right angles to one another. - Queries can be decomposed into statistically independent terms.Query terms
may have dependencies just as document terms can be dependent. How-
ever, queries are usually much smaller, typically involving just a few terms,
and any dependencies can be presumed to be the same as the ones for
documents. - Queries are highly specific.In other words, the set of relevant documents is
relatively small compared with the entire collection of documents. This
holds when the queries are small (i.e., have very few terms), but it is less
accurate when queries are large (e.g., when one compares documents with
other documents). However, modern corpora (such as Medline or the
World Wide Web) are becoming so immense, that even very large docu-
ments are small compared with the corpus.
The dot product has a nice geometric interpretation. If the two vectors
have unit length, then the dot product is the cosine of the angle between the
two vectors. For any nonzero vector v there is exactly one vector that has unit