Chapter 17.1
Managing Text
Abstract
In most organizations, text forms the basis of the majority of data in the corporation. Yet,
many corporations do little or nothing with text. For many years, there were
technological reasons why text was so difficult to handle. But in today's world, text is
easily manageable. Organizations find that there is a wealth of value that can be attained
by addressing and employing the text that is in the corporate walls.
Keywords
Text; DBMS; NLP; Stemming; Soundex; Taxonomy; Blob; Stop word; Context; Textual
ETL; In line contextualization; Post processing; Preprocessing
Text is the Wednesday's child of technology. It has been forgotten and abandoned, to the
point that organizations act as if they don’t have any text, much less text that contains
important data. Yet in most corporations, some of the most important information is
bound up in text.
For years, it was not possible to read text automatically and use it in the decision-making
process. But that has changed. Today, it is possible to read text and to include it in
standard databases. In doing so, text has become an important source of data in the
corporate decision-making process.
The Challenge of Text
There are many very valid reasons why text is so difficult to work with and manage. The
primary reason has to be that text does not fit well into a standard database management
system. Stated differently, the fit between text and a database management system is
awkward at best and a total mismatch at worst.
A standard database management system requires data to be tightly structured. The
DBMS requires that fields of data be uniform in size, that the attributes are able to be
defined, and that keys be readily available in order to store the data. The very essence of
Chapter 17.1: Managing Text