One of the fundamental issues of data is that of how data are linked to each other. This
issue is an issue in big data just as it has been an issue in other forms of information
processing.
In classical information systems, linkage of data was accomplished by matching data
values. As an example, one record contained social security number, and another record
contained social security number as well. The two units of data could then be linked
because of the existence of the same value residing in the record. The analyst could be
99.99999% assured that there was a basis for linkage. (Curiously, since the government
reissues social security numbers upon the death of an individual, the analyst cannot be
100% assured that the linkage is real.)
But with the unstructured data (i.e., textual data) that come with big data, it is necessary
to accommodate another type of relationship involving the linkage of data. In this case, it
is necessary to accommodate what can be called a probable linkage of data.
A probable linkage of data is linkage that is based on probability rather than an actual
value.
Probabilistic linkages arise wherever there is text.
As an example of a probabilistic linkage, consider the linkages of data based on name.
Suppose there are two names in different records—Bill Inmon and William Inmon.
Should these values be linked? There is a high probability that these names should be
linked. But it is only a probability, not a certainty. Suppose there are two records where
the name William Inmon is found. Should these records be linked?
One record refers to a serial killer in Arizona, and another record refers to a data
warehouse writer in Colorado. (This is a true example—look it up on the Internet to
verify.) Both individuals have the same name. But they are very different people.
When text is involved, linkage is accomplished on the basis of probability of a match, not
the certainty of a match.
Fig. 9.2.14 depicts the different kinds of linkages that are found in big data.
Chapter 9.2: Analyzing Repetitive Data