frequently there are not all that many of them. Other words occur so rarely that
they are unlikely to be useful for classification. Paradoxically, infrequent words
are common—nearly half the words in a typical document or corpus of docu-
ments occur just once. Nevertheless, such an overwhelming number of words
remains after these word classes are removed that further feature selection may be
necessary using the methods described in Section 7.1. Another issue is that the
bag- (or set-) of-words model neglects word order and contextual effects. There
is a strong case for detecting common phrases and treating them as single units.
Document classification is supervised learning: the categories are known
beforehand and given in advance for each training document. The unsupervised
version of the problem is called document clustering. Here there is no predefined
class, but groups of cognate documents are sought. Document clustering can
assist information retrieval by creating links between similar documents, which
in turn allows related documents to be retrieved once one of the documents has
been deemed relevant to a query.
There are many applications of document classification. A relatively easy
categorization task,language identification,provides an important piece of
metadata for documents in international collections. A simple representation
that works well for language identification is to characterize each document
by a profile that consists of the n-grams,or sequences ofnconsecutive letters,
that appear in it. The most frequent 300 or so n-grams are highly correlated
with the language. A more challenging application is authorship ascriptionin
which a document’s author is uncertain and must be guessed from the text.
Here, the stopwords, not the content words, are the giveaway, because their dis-
tribution is author dependent but topic independent. A third problem is the
assignment of key phrasesto documents from a controlled vocabulary of possi-
ble phrases, given a large number of training documents that are tagged from
this vocabulary.
Another general class of text mining problems is metadata extraction.Meta-
data was mentioned previously as data about data: in the realm of text the term
generally refers to salient features of a work, such as its author, title, subject clas-
sification, subject headings, and keywords. Metadata is a kind of highly struc-
tured (and therefore actionable) document summary. The idea of metadata is
often expanded to encompass words or phrases that stand for objects or “enti-
ties” in the world, leading to the notion ofentity extraction. Ordinary documents
are full of such terms: phone numbers, fax numbers, street addresses, email
addresses, email signatures, abstracts, tables of contents, lists of references,
tables, figures, captions, meeting announcements, Web addresses, and more. In
addition, there are countless domain-specific entities, such as international
standard book numbers (ISBNs), stock symbols, chemical structures, and
mathematical equations. These terms act as single vocabulary items, and many
document processing tasks can be significantly improved if they are identified
8.3 TEXT AND WEB MINING 353