P1: Sqe Trim: 6.125in×9.25in Top: 0.5in Gutter: 0.75in
CUUS2079-05 CUUS2079-Zafarani 978 1 107 01885 3 January 13, 2014 19:23
112 Data Mining Essentials
- Feature Selection. Often, not all features gathered are useful. Some
may be irrelevant, or there may be a lack of computational power
to make use of all the features, among many other reasons. In these
cases, a subset of features are selected that could ideally enhance the
performance of the selected data mining algorithm. In our example,
customer’s name is an irrelevant feature to the value of the class
attribute and the task of predicting whether the individual will buy
the given book or not. - Feature Extraction. In contrast to feature selection,feature extraction
converts the current set of features to a new set of features that can
perform the data mining task better. A transformation is performed
on the data, and a new set of features is extracted. The example we
provided for aggregation is also an example of feature extraction
where a new feature (area) is constructed from two other features
(width and height). - Sampling. Often, processing the whole dataset is expensive. With the
massive growth of social media, processing large streams of data is
nearly impossible. This motivates the need forsampling. In sampling,
a small random subset of instances are selected and processed instead
of the whole data. The selection process should guarantee that the
sample is representative of the distribution that governs the data,
thereby ensuring that results obtained on the sample are close to ones
obtained on the whole dataset. The following are three major sampling
techniques:- Random sampling. In random sampling, instances are selected
uniformly from the dataset. In other words, in a dataset of size
n, all instances have equal probabilityn^1 of being selected. Note
that other probability distributions can also be used to sample the
dataset, and the distribution can be different from uniform. - Sampling with or without replacement. In sampling with
replacement, an instance can be selected multiple times in the
sample. In sampling without replacement, instances are removed
from the selection pool once selected. - Stratified sampling. In stratified sampling, the dataset is first
partitioned into multiple bins; then a fixed number of instances
are selected from each bin using random sampling. This technique
is particularly useful when the dataset does not have a uniform
distribution for class attribute values (i.e.,class imbalance). For
instance, consider a set of 10 females and 5 males. A sample of
- Random sampling. In random sampling, instances are selected