data-architecture-a

(coco) #1

As an example of negation analysis, consider the raw text “...John Jones did not have a
heart attack....”


The data that would be generated would look like the following:


Document name, byte, context—negation, value—no
Document name, byte, context—condition, value—heart attack

Care must be taken with negation analysis because not all forms of negation are easily
handled. The good news is that most forms of negation in language are straightforward
and are easily handled. The bad news is that some forms of negation require elaborate
techniques for textual ETL management.


Numeric Tagging


Another useful form of contextualization is that of numeric tagging. It is normal for a
document to have multiple numeric values on the document. It is also normal for one
numeric value to mean one thing and another numeric value to mean something else.


For example, a document may have the following:


Payment amount
Late fee charge
Interest amount
Payoff amount
And so forth

It is most helpful to the analyst who will be analyzing the document to “tag” the different
numeric values. In doing so, the analyst can simply refer to the numeric value by its
meaning. This makes the analysis of documents that contain multiple numeric values
quite convenient. (Stated differently, if the tagging is not done at the time of textual ETL
processing, the analyst accessing and using the document will have to do the analysis at
the time the document is being analyzed, which is a time-consuming and tedious process.
It is much simpler to tag a numeric value at the moment of textual ETL processing.)


Fig. 10.1.12 shows how raw text is read and how tags are created for numeric values.


Chapter 10.1: Nonrepetitive Data
Free download pdf