Fig. 10.1.19 Classification of documents.
As an example of document classification, suppose the corporation has a document on
deepwater drilling. The database entry that would be produced looks like the following:
Document, byte, document type—exploration, document name
Proximity Analysis
Occasionally, the analyst needs to look at words or taxonomies that are in proximity to
each other. For example, when a person sees the words “New York Yankees,” the
thought is about a baseball team. But when the words “New York” and “Yankees” are
separated by two or three pages of text, the thought is something entirely different.
Therefore, it is useful to be able to do what is referred to as “proximity analysis” in
textual ETL.
Proximity analysis operates on actual words or taxonomies (or any combination of these
elements).
The analyst specifies the words/taxonomies that are to be analyzed, gives a proximity
value for how close the words need to be in the text, and gives the proximity variable a
name.
Fig. 10.1.20 shows proximity analysis operating against raw text.
Chapter 10.1: Nonrepetitive Data