As an example of the use of custom variables, consider the following raw text:
...I want to order two more cases of TR-0987-BY to be delivered on...
Upon processing the raw text, the following entry would be created inside the analytic
database:
Doc name, byte, context—part number, value—TR-0987-BY
Note that there are a few common custom variables in common use. One (in the United
States) is 999-999-9999, which is the common pattern for telephone number. Or there is
999-99-9999 that is the generic pattern for social security number.
The analyst can create whatever pattern he/she wishes for processing against the raw
text. The only “gotcha” that sometimes occurs is the case where on occasion more than
one type of variable will have the same format as another variable. In this case, there will
be confusion in trying to use custom variables.
Homographic Resolution
A powerful form of contextualization is that known as “homographic resolution.” In
order to understand homographic resolution, consider the following (very real) example.
Some doctors are trying to interpret doctor's notes. The term “ha” gives the doctors a
problem. When a cardiologist writes “ha,” the cardiologist refers to “heart attack.” When
an endocrinologist writes “ha,” the endocrinologist refers to “hepatitis A.” When a
general practitioner writes “ha,” the general practitioner refers to “headache.”
In order to create a proper analytic database, the term “ha” must be interpreted properly.
If the term “ha” is not interpreted properly, then people that have had heart attacks,
hepatitis A, and headaches will all be mixed together, and that surely will produce a
faulty analysis.
There are several elements to homographic resolution. The first element is the homograph
itself. In this case, the homograph is “ha.” The second element is the homograph class.
The homograph class in this case includes cardiologist, endocrinologist, and general
practitioner. The homographic resolution is that for cardiologists “ha” means “heart
attack”; for endocrinologists, “ha” means “hepatitis A”; and that for general
practitioners, “ha” means “head ache.”
Chapter 10.1: Nonrepetitive Data