The mechanics of textual disambiguation are shown in Fig. 10.1.4.
Fig. 10.1.4 The mechanics of textual disambiguation.
The general flow of processing in textual ETL is this. The first step is to find and read the
data. Normally, this step is straightforward. But occasionally, the data have to be
“untangled” in order for further processing to continue. In some cases, the data reside in
a unit by unit basis. This is the “normal” (or easy) case. But in other cases, the units of
data are combined into a single document, and the units of data must be isolated in the
document in order to be processed.
The second step is to examine the unit of data and determine what data need to be
processed. In some cases, all the data need to be processed. In other cases, only certain
data need to be processed. In general, this step is very straightforward.
The third step is to “parse” the nonrepetitive data. The word “parse” is a little misleading
because it is in this step that the system applies great amounts of logic. The word
Chapter 10.1: Nonrepetitive Data