Social Media Mining: An Introduction

(Axel Boer) #1

P1: Sqe Trim: 6.125in×9.25in Top: 0.5in Gutter: 0.75in
CUUS2079-05 CUUS2079-Zafarani 978 1 107 01885 3 January 13, 2014 19:23


5.1 Data 107

Attributes Class
Name Money Spent Bought Similar Visits Will Buy
John High Yes Frequently?
Mary High Yes Rarely Yes

A dataset is represented using a set offeatures, and an instance is rep-

INSTANCE,
POINT,
DATA POINT,
OR
OBSERVATION

resented using values assigned to these features. Features are also known
asmeasurementsorattributes. In this example, the features areName, FEATURES,
MEASUREMENTS,
OR
ATTRIBUTES

Money Spent,Bought Similar, andVisits; feature values for the
first instance areJohn,High,Yes, andFrequently. Given the feature
values for one instance, one tries to predict itsclass(orclass attribute)
value. In our example, the class attribute isWill Buy, and our class value
prediction for first instance isYes. An instance such as John in which the
class attribute value is unknown is called anunlabeledinstance. Similarly, a
labeledinstance is an instance in which the class attribute value in known. LABELED
AND
UNLABELED

Mary in this dataset represents a labeled instance. The class attribute is
optional in a dataset and is only necessary for prediction purposes. One
can have a dataset in which no class attribute is present, such as a list of
customers and their characteristics.
There are different types of features based on the characteristics of the
feature and the values they can take. For instance,Money Spentcan be
represented using numeric values, such as$25. In that case, we have a
continuous feature, whereas in our example it is adiscretefeature, which
can take a number of ordered values:{High, Normal, Low}.
Different types of features were first introduced by psychologist Stanley
SmithStevens [1996] as “levels of measurement” in the theory of scales. LEVELS OF
He claimed that there are four types of features. For each feature type, there MEASUREMENT
exists a set of permissible operations (statistics) using the feature values
and transformations that are allowed.
 Nominal (categorical). These features take values that are often rep-
resented as strings. For instance, a customer’s name is a nominal
feature. In general, a few statistics can be computed on nominal fea-
tures. Examples are the chi-square statistic (χ^2 ) and themode(most
common feature value). For example, one can find the most com-
mon first name among customers. The only possible transformation
on the data is comparison. For example, we can check whether our
customer’s name is John or not. Nominal feature values are often
presented in a set format.
Free download pdf