data-architecture-a

(coco) #1
Fig. 10.1.3 The only similarities are accidental.

As an example of two units of nonrepetitive data being the same, suppose there are two
e-mails that contain one word—the word “yes.” In this case, the e-mails are identical.
But the fact that they are identical is merely an act of randomness.


In general, when text finds its way into the big data environment, the units of data stored
in big data are nonrepetitive.


One approach to processing nonrepetitive data is to use a search technology. While
search technology accomplishes the task of scanning the data, search technology leaves a
lot to be desired. The two primary shortcomings of search technology are that searching
data do not leave a database that can be subsequently used for analytic purposes and the
fact that search technology does not look at or provide context for the text being
analyzed. And there are other limitations of search technology as well.


In order to do extensive analytic processing against nonrepetitive data, it is necessary to
read the nonrepetitive data and to turn the nonrepetitive data into a standard database
format. Sometimes, this process is said to take unstructured data and turn them into
structured data. That indeed is a good description of what occurs.


The process of reading nonrepetitive data and turning them into a database is called
“textual disambiguation” or “textual ETL.” Textual disambiguation is—of necessity—a
complex process because the language it processes is complex. There is no getting around
the fact that processing text is a complex process.


The result of processing nonrepetitive data in big data with textual disambiguation is the
creation of a standard database. Once data are put into the form of a standard database, it
can then be analyzed using standard analytic technology.


Chapter 10.1: Nonrepetitive Data
Free download pdf