data-architecture-a

(coco) #1
Fig. 4.6.9 Preprocessing text.

E-mails—A Special Case


E-mails are a special case of nonrepetitive unstructured data. E-mails are special because
everybody has them and because there are so many of them. Another reason why e-mails
are special is that e-mails carry with them an enormous amount of system overhead that
is useful to the system and no one else. Also, e-mails carry a lot of valuable information
when it comes to customer's attitudes and activities.


It is possible to simply send e-mails into textual disambiguation. But such an exercise is
fruitless because of the spam and blather that are found in e-mails. Spam is the
nonbusiness relevant information that is generated outside the corporation. Blather is the
internally generated correspondence that is nonbusiness related. For example, blather
contains the jokes that are sent throughout the corporation.


In order to use textual disambiguation effectively, the spam, blather, and system
information need to be filtered out. Otherwise, the system becomes overwhelmed
meaningless information.


Fig. 4.6.10 shows that there is a filter to remove unnecessary information from the stream
of e-mails before the e-mails are processed by textual disambiguation.


Chapter 4.6: Textual Disambiguation
Free download pdf