data-architecture-a

(coco) #1

Chapter 4.6


Textual Disambiguation


Abstract


There are different definitions of big data. The definition used here is that big data
encompasses a lot of data, is based on inexpensive storage, manages data by the “Roman
census” method, and stores data in an unstructured format. There are two major types of
big data—repetitive big data and nonrepetitive big data. Only a small fraction of
repetitive big data has business value, whereas almost all of nonrepetitive big data has
business value. In order to achieve business value, the context of data in big data must be
determined. Contextualization of repetitive big data is easily achieved. But
contextualization of nonrepetitive data is done by means of textual disambiguation.


Keywords


Big data; Roman census method; Unstructured data; Repetitive data; Nonrepetitive data;
Contextualization; Textual disambiguation


The process of contextualizing nonrepetitive unstructured data is accomplished by
technology known as “textual disambiguation” (or “textual ETL”). The process of textual
disambiguation has an analogous process in structured processing known as
“ETL”—“extract/transform/load.” The difference between ETL and textual ETL is that
ETL transforms old legacy system data and textual ETL transforms text. At a very high
level, they are analogous, but in terms of the actual details of processing, they are very
different.


From Narrative Into an Analytical Data Base


The purpose of textual disambiguation is to read raw text—narrative—and to turn that
text into an analytic database. Fig. 4.6.1 shows the general flow of data in textual
disambiguation.


Chapter 4.6: Textual Disambiguation
Free download pdf