APC Australia - September 2019

(nextflipdebug2) #1

nuts-and-bolts method of using
computational algorithms to find
patterns within a set of data or
‘dataset’. Data mining, however, is the
broader field that seeks information
and knowledge from data. In fact, the
area of ‘data science’ is also described
as ‘knowledge discovery in databases’
(KDD). In other words, the way you
mine data for knowledge is to use
machine learning.


NO MACHINE LEARNING
IS PERFECT
However, the important thing to
remember is that, despite the hype,
machine learning isn’t perfect. There is
no one machine learning algorithm
that can perfectly identify every
pattern with 100% accuracy in every
dataset. It hasn’t debuted here yet, but
there’s been plenty of debate in the U.K.
during the last couple of years over
start-up Babylon Health’s ‘GP at Hand’
service using ‘artificial intelligence’ to
perform medical triage on patients,
releasing ‘live’ doctors to see more
acutely-ill patients. The machine-
learning chatbot behind the ‘GP at
Hand’ app was reportedly tested by
Babylon Health on questions similar to
those used in the Royal College of
General Practitioners membership
exam. While human doctors are said to
score around 72% on average, the
chatbot claimed a score of 81% on its
first attempt (tinyurl.com/y36bxnht).
Depending on which side of the ‘AI
doctor’ divide you stand, that 81%
either excites or scares you. In our
context, the key point is the score
wasn’t 100%. Because life rarely is.


A WORKING EXAMPLE
The University of California, Irvine
(UCI) houses arguably the world’s most
well-known dataset archive, used in
countless academic research papers,
from machine-learning to health
research. You’ll find it at https://
archive.ics.uci.edu/ml/datasets.php.
One of the 450-plus datasets in the
archive is the ‘spambase’ dataset. This
dataset, created by Hewlett-Packard
back in 1999, contains 4,601 records of
emails, each with 57 features or
‘attributes’ defining some aspect of the
email and an overall ‘class’ attribute of
whether the email was spam (1) or not
(0). The first 48 attributes look at
different word frequencies (how many
times a particular word occurs in the
email), the next six attributes count
specific character frequencies, while
the last three look at average,
maximum and total sequence length of
capitalised letters.

What we’ll do is use machine
learning to see if we can find patterns
within those 57 attributes that help
determine whether an email is spam or
not.

RUNNING YOUR OWN ML
Now you don’t need a fancy system in
order to implement machine learning.
Any PC or laptop running Windows,
macOS or Linux will do. The only thing
is the older the system, the slower it
runs (no surprises there).
Start by heading to http://www.cs.waikato.
ac.nz/ml/weka/downloading.html
and download the latest ‘stable version’
(at time of writing, this was version
3.8.3) for your operating system. We’ll
work with the Windows version, but
the others should be similar. Weka
requires Java, so either Java 8 or 9 is
recommended. If you don’t have Java,
just choose a Weka download that
includes the Java 1.8 virtual machine.

IFTTT works on the
same principle as
decision rules in
machine learning.

OpenML houses ARFF
versions of many UCI
repository datasets.
Free download pdf