data-architecture-a

(coco) #1

Chapter 4.5


Contextualizing Repetitive Unstructured Data


Abstract


There are different definitions of big data. The definition used here is that big data
encompasses a lot of data, is based on inexpensive storage, manages data by the “Roman
census” method, and stores data in an unstructured format. There are two major types of
big data—repetitive big data and nonrepetitive big data. Only a small fraction of
repetitive big data has business value, whereas almost all of nonrepetitive big data has
business value. In order to achieve business value, the context of data in big data must be
determined. Contextualization of repetitive big data is easily achieved. But
contextualization of nonrepetitive data is done by means of textual disambiguation.


Keywords


Big data; Roman census method; Unstructured data; Repetitive data; Nonrepetitive Data;
Contextualization; Textual disambiguation


In order to be used for analysis, all unstructured data need to be contextualized. This is as
true for repetitive unstructured data as it is for nonrepetitive unstructured data. But there
is a big difference between contextualizing repetitive unstructured data and nonrepetitive
unstructured data. That difference is that contextualizing repetitive unstructured data is
easy and straightforward to do, whereas contextualizing nonrepetitive unstructured data
is anything but easy to do.


Parsing Repetitive Unstructured Data


In the case of repetitive unstructured data, the data are read, usually in Hadoop. After the
block of data is read, the data are then parsed. Given the repetitive nature of the data,
parsing the data is straightforward. The record is small, and the context of the record is
easy to find.


The process of parsing and contextualizing the data found in big data can be done with a
commercial utility or can be a custom-written program.


Chapter 4.5: Contextualizing Repetitive Unstructured Data
Free download pdf