APC Australia - September 2019

(nextflipdebug2) #1

howto»machine learning masterclass


DATAPRE-PROCESSING
Onceyou’veinstalledit,headtotheUCI
MachineLearningRepositorywebsite
pagefortheSpambasedatasetat
https://archive.ics.uci.edu/ml/
datasets/Spambase.Clickonthe‘Data
Folder’linkanddownloadthefiles
‘spambase.data’and‘spambase.names’.
Nowthatwe’vegotthedata,weneed
toknockitintoshape–thisstepisoften
called‘datapreprocessing’.Weka
handlesdatainARFFandCSV(comma-
separatedvariable)format.However,
whiletheSpambase.datafileisCSV-
ready,itlacksaheaderrowcontaining
theattributedescriptorsor‘names’.
That’swherethe‘spambase.names’file
comesin.Whatwe’dnormallydonowis
createtheheaderrowbycopyingthe
attributenamesfromthe‘spambase.
names’fileintothe‘spambase.data’file,
separatingeachattributenamewitha
comma(,).However,forthesakeof
brevity,we’lltakeashort-cut–headover
totheOpenMLwebsiteathttps://www.
openml.org/d/44andyoucandownload
aready-to-goAttribute-RelationFile
Format(ARFF)versioninstead.
Onceyou’vedownloadedtheARFF
dataset,fireupWekaandyou’llgetthe
WekaGUIChooserpanel.Clickonthe
ExplorerbuttontolaunchtheWeka
Explorer.Next,clicktheOpenFile
buttonjustunderthePreprocesstab
andloadthe‘dataset_44_spambase.arff’
file.Shortlyafter,you’llseethelistof
attributesontheleft.Clickononeand
you’llgetbriefdetailsontheright-side
panel. Scroll down to the last attribute,

‘class’–thisistheattributethat
describeswhetherornoteachemail
recordwasclassifiedas‘spam’(1,red)
or‘not’(0,blue).Ofthe4,601records,
1,813wereclassifiedasspamand the
remaining2,788werenot.

CLASSIFICATION
Becauseeachrecordalreadyhasaclass
value,whatwe’redoingiscalled
‘supervisedlearning’–we’relookingfor
thepatternsintheattributesthatmatch
whichemailrecordsarespamandwhich
arenot.Thenextstepistoselectthe
‘Classify’tabatthetopofthescreen,
thenpressthe‘Choose’button.Weka
comeswithawidearrayofclassification,
clusteringandassociationrulemining
algorithms built-in. The new subpanel

shouldopenupwiththe‘Rules’sub-list
showing.Select‘JRip’.JRipisaJava
languageversionofthe‘Repeated
IncrementalPruningtoProduceError
Reduction’orRIPPERalgorithmand
we’lluseittodiscoveralistof‘decision
rules’bywhichwecanclassifythe4,601
emailrecordsas‘spam’or‘notspam’.
Whenyou’reready,pressthe‘Start’
buttonandWekawillgetcracking.The
firststep–discoveringtherules–
shouldtakejustacoupleofseconds.The
secondstep–performingwhat’scalled
‘ten-foldcross-validation’–willtakea
littlelonger.
Attheend,you’llendupwitha
resultssummarythatshowsa
percentagefor‘correctlyclassified
instances’ of 92.393%. What this is

TheJRipalgorithmcorrectly
classifies92.393%of
Spambase dataset records.

UsethePreprocess
windowtoviewclass
values of your dataset.
Free download pdf