APC Australia - September 2019

nuts-and-bolts method of using
computational algorithms to find
patterns within a set of data or
‘dataset’. Data mining, however, is the
broader field that seeks information
and knowledge from data. In fact, the
area of ‘data science’ is also described
as ‘knowledge discovery in databases’
(KDD). In other words, the way you
mine data for knowledge is to use
machine learning.

NO MACHINE LEARNING
IS PERFECT
However, the important thing to
remember is that, despite the hype,
machine learning isn’t perfect. There is
no one machine learning algorithm
that can perfectly identify every
pattern with 100% accuracy in every
dataset. It hasn’t debuted here yet, but
there’s been plenty of debate in the U.K.
during the last couple of years over
start-up Babylon Health’s ‘GP at Hand’
service using ‘artificial intelligence’ to
perform medical triage on patients,
releasing ‘live’ doctors to see more
acutely-ill patients. The machine-
learning chatbot behind the ‘GP at
Hand’ app was reportedly tested by
Babylon Health on questions similar to
those used in the Royal College of
General Practitioners membership
exam. While human doctors are said to
score around 72% on average, the
chatbot claimed a score of 81% on its
first attempt (tinyurl.com/y36bxnht).
Depending on which side of the ‘AI
doctor’ divide you stand, that 81%
either excites or scares you. In our
context, the key point is the score
wasn’t 100%. Because life rarely is.

A WORKING EXAMPLE The University of California, Irvine (UCI) houses arguably the world’s most well-known dataset archive, used in countless academic research papers, from machine-learning to health research. You’ll find it at https:// archive.ics.uci.edu/ml/datasets.php. One of the 450-plus datasets in the archive is the ‘spambase’ dataset. This dataset, created by Hewlett-Packard back in 1999, contains 4,601 records of emails, each with 57 features or ‘attributes’ defining some aspect of the email and an overall ‘class’ attribute of whether the email was spam (1) or not (0). The first 48 attributes look at different word frequencies (how many times a particular word occurs in the email), the next six attributes count specific character frequencies, while the last three look at average, maximum and total sequence length of capitalised letters.

What we’ll do is use machine learning to see if we can find patterns within those 57 attributes that help determine whether an email is spam or not.

RUNNING YOUR OWN ML Now you don’t need a fancy system in order to implement machine learning. Any PC or laptop running Windows, macOS or Linux will do. The only thing is the older the system, the slower it runs (no surprises there). Start by heading to http://www.cs.waikato. ac.nz/ml/weka/downloading.html and download the latest ‘stable version’ (at time of writing, this was version 3.8.3) for your operating system. We’ll work with the Windows version, but the others should be similar. Weka requires Java, so either Java 8 or 9 is recommended. If you don’t have Java, just choose a Weka download that includes the Java 1.8 virtual machine.

IFTTT works on the same principle as decision rules in machine learning.

OpenML houses ARFF versions of many UCI repository datasets.

APC Australia - September 2019

Get our desktop app

Company

Features

Documentation

Resources