6.2 Vector Space Retrieval 131
relational databases. Formal query languages are programming languages
specialized for retrieval. The advantage of using a formal query language
is that one always has perfect retrieval: 100% coverage and 100% selectivity.
This holds because the criteria for retrieval have no ambiguity. But there are
several disadvantages. One must learn to program in the query language,
which can require a significant effort, and this technique only applies to a
corpus that is highly structured, such as a database or collection of XML
documents. Formal query languages for XML documents are discussed in
chapter 8.
Summary
- Online search engines are based on the standard model for information
retrieval.
- In the standard model, a query is matched against a corpus and the most
relevant documents are retrieved.
- The quality of the retrieval is measured by the coverage and selectivity.
6.2 Vector Space Retrieval
The simplest search technique is to look for documents that contain the words
specified in a query. From this point of view a document is simply a set of
words, and the same is true of a query. Search consists of finding the docu-
ments that contain the words of the query. Many retrieval systems use this
basic technique, but this is only effective for relatively small repositories. The
problem is that the number of matches to a query can be very large, so some
mechanism must be provided that selects among the matching documents
or arranges the documents so that the best matches appear first.
Simply arranging the matching documents by the number of matching
words is not very effective because words differ in their selectivity. A word
such as βtheβ in English has little use in search by word matching because
nearly every document that uses English will have this word. For example,
PubMed (NIH 2004b) is a very large corpus containing titles, abstracts, and
other information about medical research articles. Table 6.1 gives the number
of times that the most common words occur in PubMed. The second column
of this table gives the number of times that the word occurs in the text parts
of the PubMed citations. The third column gives the number of documents