Fig. 10.1.16 Stop word processing.
In order to envision how stop word processing works, consider the following raw text:
...he walked up the steps, looking to make sure he carried the bag properly...
After stops words are removed, the resulting raw text would look like the following:
...walked steps looking carried bag...
Word Stemming
Another sometimes useful editing feature of textual ETL is that of stemming. Latin-based
words have word stems. There are usually many forms of the same word. Consider the
stem “mov.” The different forms of the word stem mov include move, mover, moves,
moving, and moved. Note that the stem itself may or may not be an actual word.
Oftentimes, it is useful to make associations of text that uses the same word stems. It is
easy to reduce a word down to its word stem in textual ETL, as seen in Fig. 10.1.17.
Chapter 10.1: Nonrepetitive Data