data-architecture-a

(coco) #1

is not a mistake.


In order to explain this anomaly and explain why it is important, consider the following
real example.


In general, unstructured data can be considered to be repetitive and nonrepetitive.
Repetitive unstructured data are unstructured data whose content and structure are highly
repetitive. Into this classification of data fall clickstream data, analog data, metering data,
and so forth. Into the other classification of data fall all data that are written. There are e-
mails, call center data, customer feedback, contracts, and a whole host of other written
and spoken narrative data.


Now, consider that in the classification of narrative data, there appears a further
subclassification of data. For all written data, there can be nonrepetitive written data and
repetitive written data. For example, lawyers who write contracts use what is called
“boilerplate.” A boilerplate contract is a contract where the primary body of the contract
is predetermined. The lawyer only fills in a few details into the contract such as the name,
address, and social security number of the recipient of the contract. There may be a few
other terms that are negotiated, but at the end of the day, the boilerplate contracts are
very, very similar.


This then is an example of a repetitive nonrepetitive occurrence of data. The contract is
nonrepetitive because it is in narrative form. But it is repetitive because it is essentially
boilerplate.


The reason why making the distinction between nonrepetitive nonrepetitive text and
nonrepetitive repetitive text is that taxonomies apply to nonrepetitive nonrepetitive text.
Some examples are needed here to explain this anomaly.


Applicability of Taxonomies


Taxonomies are most applicable to text such as e-mails, call center information,
conversations, and other free-form narrative text. In free-form text, it is necessary to
classify words using only the context associated by the taxonomy. As an example, the
word ice cream is encountered. Ice cream belongs in the taxonomy of “dessert.” It is
assumed that the e-mail is about food and meals and desserts. Another e-mail mentions
cake. Cake too is a dessert. So, the e-mails are related to each other, even though the
words—“ice cream” and “cake”—are very different. Using taxonomic classification in


Chapter 4.7: Taxonomies
Free download pdf