data-architecture-a

(coco) #1

The next step in the solution to dealing with text that needed to go into a database was to
employ the practice of “stemming.” Stemming was the practice of defining words that are
related at the root stem. For example, the word move has a relation to the words moving,
mover, moved, mover, and so forth. The word move is the stem of the other words.
Stemming was the first real step toward the systemic analysis of words. However,
stemming had little practical value. Stemming was an interesting exercise, but stemming
had little practical use.


Along with stemming came the practice of soundex. In soundex, words are spelled and
classified according to their sound. Like stemming, soundex had few practical
applications. However, both stemming and soundex were the first steps in starting to deal
with text systemically.


The next step was the practice of identification and the removal of stop words. Stop
words are extraneous words that are needed for proper communication but which are
extraneous to the meaning of what is being said. Typical stop words are words such as
“a,” “and,” “the,” and “to.”


In a way, stop word removal was the first significant practical step to starting to deal with
text. Stop word removal erased words that “got in the way” and removed unnecessary
text for further consideration.


After stop word removal came tagging. Tagging is the practice of examining a document
and finding and identifying desired words found in the document. Tagging words inside a
document is a good and effective way to start to understand what is inside a document.
However, tagging had several drawbacks. The first drawback of tagging is that in order to
know how to tag a document, you had to know what words you were looking for before
you ever did the tagging. This presupposes that you know what the person is going to say
before they say it. And in most circumstances, that is a fallacious assumption. The second
drawback of tagging is that there is a lot more to understand about text than the mere
identification of words.


Nevertheless, tagging was a real step forward in the management of text.


The next step in the progression to putting text into a database was that of using
taxonomies in order to analyze sentences. Taxonomic resolution occurs when a taxonomy
is created, and the taxonomy is matched against the raw text. In matching the text, words
could be classified. In many regards, the use of taxonomies was the secret that began to
unlock the process of textual analysis. There are MANY things that can be done with text


Chapter 17.1: Managing Text
Free download pdf