APC Australia - September 2019

howto»machine learning masterclass

DATAPRE-PROCESSING Onceyou’veinstalledit,headtotheUCI MachineLearningRepositorywebsite pagefortheSpambasedatasetat https://archive.ics.uci.edu/ml/ datasets/Spambase.Clickonthe‘Data Folder’linkanddownloadthefiles ‘spambase.data’and‘spambase.names’. Nowthatwe’vegotthedata,weneed toknockitintoshape–thisstepisoften called‘datapreprocessing’.Weka handlesdatainARFFandCSV(comma- separatedvariable)format.However, whiletheSpambase.datafileisCSV- ready,itlacksaheaderrowcontaining theattributedescriptorsor‘names’. That’swherethe‘spambase.names’file comesin.Whatwe’dnormallydonowis createtheheaderrowbycopyingthe attributenamesfromthe‘spambase. names’fileintothe‘spambase.data’file, separatingeachattributenamewitha comma(,).However,forthesakeof brevity,we’lltakeashort-cut–headover totheOpenMLwebsiteathttps://www. openml.org/d/44andyoucandownload aready-to-goAttribute-RelationFile Format(ARFF)versioninstead. Onceyou’vedownloadedtheARFF dataset,fireupWekaandyou’llgetthe WekaGUIChooserpanel.Clickonthe ExplorerbuttontolaunchtheWeka Explorer.Next,clicktheOpenFile buttonjustunderthePreprocesstab andloadthe‘dataset_44_spambase.arff’ file.Shortlyafter,you’llseethelistof attributesontheleft.Clickononeand you’llgetbriefdetailsontheright-side panel. Scroll down to the last attribute,

‘class’–thisistheattributethat describeswhetherornoteachemail recordwasclassifiedas‘spam’(1,red) or‘not’(0,blue).Ofthe4,601records, 1,813wereclassifiedasspamand the remaining2,788werenot.

CLASSIFICATION Becauseeachrecordalreadyhasaclass value,whatwe’redoingiscalled ‘supervisedlearning’–we’relookingfor thepatternsintheattributesthatmatch whichemailrecordsarespamandwhich arenot.Thenextstepistoselectthe ‘Classify’tabatthetopofthescreen, thenpressthe‘Choose’button.Weka comeswithawidearrayofclassification, clusteringandassociationrulemining algorithms built-in. The new subpanel

shouldopenupwiththe‘Rules’sub-list showing.Select‘JRip’.JRipisaJava languageversionofthe‘Repeated IncrementalPruningtoProduceError Reduction’orRIPPERalgorithmand we’lluseittodiscoveralistof‘decision rules’bywhichwecanclassifythe4,601 emailrecordsas‘spam’or‘notspam’. Whenyou’reready,pressthe‘Start’ buttonandWekawillgetcracking.The firststep–discoveringtherules– shouldtakejustacoupleofseconds.The secondstep–performingwhat’scalled ‘ten-foldcross-validation’–willtakea littlelonger. Attheend,you’llendupwitha resultssummarythatshowsa percentagefor‘correctlyclassified instances’ of 92.393%. What this is

TheJRipalgorithmcorrectly classifies92.393%of Spambase dataset records.

UsethePreprocess windowtoviewclass values of your dataset.

APC Australia - September 2019

Get our desktop app

Company

Features

Documentation

Resources