11

(Marcin) #1

Thinking Pi


TUTORIAL


iris[‘target’] contains the output flower name
(encoded as a 0,1, or 2) for each flower row of the
input data matrix. Run ‘iris[‘target’]’ and you can
see the first 50 are all 0s, so the first 50 rows from
the input data matrix are all the same flower type.
Each number corresponds to the flower names in
iris[‘target_names’]. 0 is I. setosa, 1 is I. versicolor,
and 2 is I. virginica.

You can get more information on the dataset by
printing out the DESCR of the object:

print(iris[‘DESCR’])

OK, so we have our data! The machine learning
algorithm that will learn the mapping between the
flower features to the flower name will be a decision
tree. Decision trees are models that try to build a
tree of questions that split the data into the separate
classes (flower types).
We’ll start by importing the algorithm from
scikit-learn.

from sklearn.tree import DecisionTreeClassifier

We want to train our model on some of the data,
and save a portion of our data for testing. It would not
be wise to test a student by just getting them to do
an exam they have been practising with and already
have the answers for, right? They could just memorise
the answers without learning the pattern. So, you give
them a different exam that they don’t already have the
answers to, and compare the answers they ‘predict’
with the real ‘target’ answers.
So, we split the data into training and test data, and
also shuffle the data rows so that there is a good mix
of each flower in both train and test data.
The train_test_split function in scikit-learn does
all of this for us and, by default, splits the data so that
25% is put into test, and 75% for training.

from sklearn.model_selection import train_test_
split
X = iris[‘data’]
y = iris[‘target’]
X_train, X_test, y_train, y_test = train_test_
split(X, y)

Now we create a new decision tree model and
train/fit it to our training data. This is the training
phase, so it gets to look at the inputs (flower features
stored in X_train) alongside the outputs (flower
names stored in y_train).

Left
Example of simple
decision tree model
after learning
iris problem

We split the data into training and test data,
and also shuffle the data rows so that there

is a good mix of each flower in both train and
test data



Petal Length > 2.4

Setosa Petal Length < 1.4

Versicolor Virginica

YES

YES

NO

NO
Free download pdf