Note that textual ETL operates on taxonomies/ontologies as if the taxonomies were a
simple word pair. In fact, taxonomies and ontologies are much more complex that simple
word pairs. But even the most sophisticated taxonomy can be decomposed into a series
of simple word pairs.
In general, the usage of taxonomies as a form of contextualization is the most powerful
tool the analyst has in determining the context of raw text.
Custom Variables
Another very useful form of contextualization is that of the identification of and creation
of what can be termed “custom variables.” Almost every organization has custom
variables. A custom variable is a word or phrase that is recognizable entirely from the
format of the word or phrase. As a simple example, a manufacturer may have its part
numbers in the form of “AK-876-uy.” Looking at a part number, generically, the generic
form of the part number would be “CC-999-cc.” In this case, “C” indicates a capital
character, “-“ indicates the literal “-“, “9” indicates any numeric digit, and “c” indicates
a lower case character.
By looking at the format of a word or phrase, the analyst can tell immediately the context
of the variable.
Fig. 10.1.8 shows how raw text is processed using custom variables.
Fig. 10.1.8 Custom variable format processing.
Chapter 10.1: Nonrepetitive Data