APC Australia - September 2019

saying is that the rules produced by the
JRip algorithm are able to correctly
classify an average of just over 92
email records out of every 100, which is
actually pretty darn good.

WHAT ARE ‘DECISION RULES’?
Ever played around with the web
service ‘If This Then That’ (ifttt.com)?
It allows you to create conditions or
‘rules’ by which you can set activities to
occur, such as automatically lighting
the path for the pizza guy when he
comes to drop off your pizza. It uses the
simple ‘IF THEN ’
rule scheme. Decision Rules work the
same way, except that the
can be multiple attributes having
certain values or ranges of values and
the is the appropriate class
value, which, in our case, is spam (1) or
not spam (0).
Scroll back up the Weka output
screen until you see the ‘JRIP rules:’
header. JRip found 17 rules for
determining the classification of the
4,601 email records. The first rule is:

(charfreq %21 >= 0.079) and (char
freq %24 >= 0.013) and (capitalrun
lengthlongest >= 43) and (char
freq _%23 >= 0.008) => class=1 (337.0/0.0)

This rule says that if the frequency
of character ‘%21’ (exclamation mark, !)
is greater than or equal to 0.079, the
frequency of character ‘%24’ (dollar
sign, $) is greater than or equal to
0.013, the longest run of capitalised
letter is 43 or more and the frequency

of character ‘%23’ (hash, #) is greater than or equal to 0.008, then the email record is considered ‘spam’ (class=1). The (337.0/0.0) at the end indicates there were 337 records that had this combination of attributes and values, with zero cases where the class was not ‘spam’. We don’t have the space to follow up all remaining 16 rules but each one can be applied to each record in the same way and in 92.393% of records, these rules will get you the right answer. From here, if you were into coding, you could use these decision rules as the

basis of a (very basic) spam filter using a simple form of ‘natural language processing’ (NLP). You process each email, counting up the particular words and characters, then feed the results into the rules. The class value the rules suggest then gives you the answer to whether the email was spam or not. Now all that said, this dataset dates back to 1999, so it’s pretty long in the tooth. However, it does show, albeit in a fairly simplistic way, that machine learning can be applied to almost any application – including identification of spam email.

Click the Explorer button when the Weka Gui Chooser appears on screen.

The rules JRip generates follow the standard ‘if-this-then-that’ schema.

APC Australia - September 2019

Get our desktop app

Company

Features

Documentation

Resources