untitled

(ff) #1

6.2 Vector Space Retrieval 137


term frequency, inverse document frequency(TFIDF) weighting scheme. This
scheme is by far the most commonly used term-weighting scheme informa-
tion retrieval using the vector space model.
The interpretation of documents and queries as vectors gives a versatile
and intuitive geometric approach for information retrieval. The analysis
above made a number of assumptions which generally do not hold in prac-
tice. However, the vector space model can be adjusted to some extent when
these assumptions do not hold.



  1. Documents can be decomposed into statistically independent terms.Statistical
    dependencies among terms are manifest in the vector space model as axes
    in the vector space that are not orthogonal. For example, “integrin” oc-
    curs in about 28,000 PubMed citations, “primate” occurs in about 116,000
    citations, and “fibronectin” occurs in about 23,000 citations. Given that
    PubMed has about 15 million citations, one would expect that about 200
    citations would have both “integrin” and “primate” and that about 40 ci-
    tations would have both “integrin” and “fibronectin.” In fact, there are
    about 300 citations in the former case and almost 5000 citations in the
    latter. This suggests that “integrin” and “primate” are nearly indepen-
    dent, but that “integrin” and “fibronectin” are significantly correlated.
    One can incorporate correlations into the vector space model by using
    a nonorthogonal basis. In other words, the terms are no longer geometri-
    cally at right angles to one another.

  2. Queries can be decomposed into statistically independent terms.Query terms
    may have dependencies just as document terms can be dependent. How-
    ever, queries are usually much smaller, typically involving just a few terms,
    and any dependencies can be presumed to be the same as the ones for
    documents.

  3. Queries are highly specific.In other words, the set of relevant documents is
    relatively small compared with the entire collection of documents. This
    holds when the queries are small (i.e., have very few terms), but it is less
    accurate when queries are large (e.g., when one compares documents with
    other documents). However, modern corpora (such as Medline or the
    World Wide Web) are becoming so immense, that even very large docu-
    ments are small compared with the corpus.
    The dot product has a nice geometric interpretation. If the two vectors
    have unit length, then the dot product is the cosine of the angle between the
    two vectors. For any nonzero vector v there is exactly one vector that has unit

Free download pdf