untitled

(ff) #1

6.2 Vector Space Retrieval 139


It is difficult to map the inflected forms of an English word to a single
concept because inflection is highly irregular and ambiguous.


  1. The vector space model treats the document as just a collection of un-
    connected and unrelated terms. There is no meaning beyond the terms
    themselves.

  2. It presumes that the terms are statistically independent, both in the collec-
    tion as a whole and in each document. The vector space model in general
    allows for terms that are correlated, but it is computationally difficult even
    to find correlations between pairs of terms, let alone sets of three or more
    terms, so very few retrieval engines attempt to find or to make use of such
    correlations.

  3. By focusing exclusively on terms, it cannot take advantage of document
    structure. webpages and XML documents have a hierarchical structure
    whose elements are tagged. XML document elements are especially mean-
    ingful, but none of this meaning is expressible in the vector space model.

  4. By treating documents as independent entities, the vector space model
    cannot take advantage of interdocument links such as the citations that
    occur in scientific research papers and the hypertext links that occur in
    webpages.


Some systems attempt to alleviate these problems by adding dependencies
between terms such as how close the terms are to each other in the document.
However, these improvements do not address the fundamental weaknesses
of this approach.
Ontologies can be useful tools for dealing with these deficiencies, and
some of the techniques are introduced in the next section.


Summary



  • Words have different degrees of selectivity.

  • In the vector space model each document and query is represented by a
    vector where each component of the vector is the term weight for a word
    that can occur in a document.

  • The most common term weight is the TFIDF weight which is the product
    of the number of times that the word occurs in the document times the
    logarithm of the inverse of the number of documents that have the word.

Free download pdf