APC Australia - September 2019

(nextflipdebug2) #1

saying is that the rules produced by the
JRip algorithm are able to correctly
classify an average of just over 92
email records out of every 100, which is
actually pretty darn good.


WHAT ARE ‘DECISION RULES’?
Ever played around with the web
service ‘If This Then That’ (ifttt.com)?
It allows you to create conditions or
‘rules’ by which you can set activities to
occur, such as automatically lighting
the path for the pizza guy when he
comes to drop off your pizza. It uses the
simple ‘IF THEN
rule scheme. Decision Rules work the
same way, except that the
can be multiple attributes having
certain values or ranges of values and
the is the appropriate class
value, which, in our case, is spam (1) or
not spam (0).
Scroll back up the Weka output
screen until you see the ‘JRIP rules:’
header. JRip found 17 rules for
determining the classification of the
4,601 email records. The first rule is:


(charfreq %21 >= 0.079) and (char
freq
%24 >= 0.013) and (capitalrun
lengthlongest >= 43) and (char
freq _%23 >= 0.008) => class=1 (337.0/0.0)


This rule says that if the frequency
of character ‘%21’ (exclamation mark, !)
is greater than or equal to 0.079, the
frequency of character ‘%24’ (dollar
sign, $) is greater than or equal to
0.013, the longest run of capitalised
letter is 43 or more and the frequency


of character ‘%23’ (hash, #) is greater
than or equal to 0.008, then the email
record is considered ‘spam’ (class=1).
The (337.0/0.0) at the end indicates
there were 337 records that had this
combination of attributes and values,
with zero cases where the class was not
‘spam’. We don’t have the space to
follow up all remaining 16 rules but
each one can be applied to each record
in the same way and in 92.393% of
records, these rules will get you the
right answer.
From here, if you were into coding,
you could use these decision rules as the

basis of a (very basic) spam filter using
a simple form of ‘natural language
processing’ (NLP). You process each
email, counting up the particular words
and characters, then feed the results
into the rules. The class value the rules
suggest then gives you the answer to
whether the email was spam or not.
Now all that said, this dataset dates
back to 1999, so it’s pretty long in the
tooth. However, it does show, albeit in
a fairly simplistic way, that machine
learning can be applied to almost any
application – including identification
of spam email.

Click the Explorer
button when the Weka
Gui Chooser appears
on screen.

The rules JRip generates
follow the standard
‘if-this-then-that’ schema.
Free download pdf