Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

cleaned up. The idea of company wide database integration is known as data
warehousing.Data warehouses provide a single consistent point of access to cor-
porate or organizational data, transcending departmental divisions. They are
the place where old data is published in a way that can be used to inform busi-
ness decisions. The movement toward data warehousing is a recognition of the
fact that the fragmented information that an organization uses to support day-
to-day operations at a departmental level can have immense strategic value
when brought together. Clearly, the presence of a data warehouse is a very useful
precursor to data mining, and if it is not available, many of the steps involved
in data warehousing will have to be undertaken to prepare the data for mining.
Often even a data warehouse will not contain all the necessary data, and you
may have to reach outside the organization to bring in data relevant to the
problem at hand. For example, weather data had to be obtained in the load
forecasting example in the last chapter, and demographic data is needed for
marketing and sales applications. Sometimes called overlay data,this is not nor-
mally collected by an organization but is clearly relevant to the data mining
problem. It, too, must be cleaned up and integrated with the other data that has
been collected.
Another practical question when assembling the data is the degree of aggre-
gation that is appropriate. When a dairy farmer decides which cows to sell, the
milk production records—which an automatic milking machine records twice
a day—must be aggregated. Similarly, raw telephone call data is of little use when
telecommunications companies study their clients’ behavior: the data must be
aggregated to the customer level. But do you want usage by month or by quarter,
and for how many months or quarters in arrears? Selecting the right type and
level of aggregation is usually critical for success.
Because so many different issues are involved, you can’t expect to get it right
the first time. This is why data assembly, integration, cleaning, aggregating, and
general preparation take so long.


ARFF format

We now look at a standard way of representing datasets that consist of inde-
pendent, unordered instances and do not involve relationships among instances,
called an ARFF file.
Figure 2.2 shows an ARFF file for the weather data in Table 1.3, the version
with some numeric features. Lines beginning with a %sign are comments.
Following the comments at the beginning of the file are the name of the rela-
tion (weather) and a block defining the attributes (outlook, temperature, humid-
ity, windy, play?). Nominal attributes are followed by the set of values they can
take on, enclosed in curly braces. Values can include spaces; if so, they must be
placed within quotation marks. Numeric values are followed by the keyword
numeric.


2.4 PREPARING THE INPUT 53

Free download pdf