106 January & February 2021 http://www.elektormagazine.com
for our example:
Model accuracy on test set is: 0.8631921824104235
The confusion matrix of the model is:
[[276 41]
[ 43 254]]
We immediately notice that the accuracy of the model has
decreased. This is presumably due to the greater heterogeneity
induced in the dataset. However, looking at the confusion matrix,
we immediately notice that the model has in reality improved its
generalization capabilities, also succeeding in correctly classifying
samples belonging to anomalous situations.
Conclusions and references
In this article we introduced a pipeline for the analysis of data from
real processes in Python. It is clear, however, that each of the topics
covered is extremely diverse and theoretical and practical experi-
ence is essential if you want to seriously engage in data analysis. We
also learned that you should not stop at the first result achieved,
even in such complex situations such as the one discussed. Instead,
it is necessary to interpret the results obtained from different points
of view in order to discover the difference between a working model
and a model that is, more or less obviously, distorted.
The message to take home is then the following: data analysis
cannot be a mechanical discipline, instead demanding a critical,
in-depth and varied analysis of the phenomenon under observa-
tion, guided by theoretical notions and practical skills. I can also
highly recommend the references provided through which you
can deepen your knowledge in some of the aspects touched on in
the article, along with the link to the GitLab repository where you
can consult the code written for this article.
200505-01
This article was first published in Italian by Elettronica Open Source
(https://it.emcelettronica.com). Elektor translated it with permission.
Once the training is finished, the trained model is used to classify
the test samples:
y_pred = clf.predict(X_test)
This results in two labels that relate to each of the two tests. The
first one, belonging to y_test, represents the ‘truth’, while the
second one, belonging to y_pred, is the value predicted by the
algorithm. By comparing them we determine both the accuracy
and the confusion_matrix.
accuracy = accuracy_score(y_test, y_pred))
cf = confusion_matrix(y_test, y_pred))
The example generates the following results:
Model accuracy on test set is: 0.9341825902335457
The confusion matrix of the model is:
[[440 0]
[ 31 0]]
The accuracy, which is around 93%, is very good, so the model looks
good. However, we note that the model is very accurate at classify-
ing the samples of the predominant class, but equally inaccurate in
classifying the samples belonging to the minority class. Thus a bias
is evidently present. Therefore, we need a strategy to improve this
situation. This can be performed by upsampling the data belonging
to the minority class so as to balance, at least partially, the dataset.
To do this we will use the Pandas resample function.
normals = data[data[’classvalue’] == -1]
anomalies = data[data[’classvalue’] == 1]
anomalies_upsampled = resample(anomalies, replace=True,
n_samples=len(normals))
This increases the size of the dataset in order to bring the number
of normal samples closer to the number of abnormal samples. As
a result we have to redefine both X and Y as follows:
upsampled = pd.concat([normals, anomalies_upsampled])
X_upsampled = upsampled.drop(’classvalue’, axis=1)
y_upsampled = upsampled[’classvalue’]
By performing the training again (including repeating the split
and normalization procedures) we acquire the following results
[1] MATLAB GPU computing support: https://uk.mathworks.com/solutions/gpu-computing.html
[2] Beginners guides for Python programming: https://wiki.python.org/moin/BeginnersGuide/Programmers
[3] Python: http://www.python.org/
[4] Best practices for optimisation in MATLAB:
https://uk.mathworks.com/videos/best-practices-for-optimisation-in-matlab-96756.html
[5] UCI SECOM dataset: http://www.kaggle.com/paresh2047/uci-semcom/kernels
[6] Scikit-Learn user guide: https://scikit-learn.org/stable/user_guide.html
[7] Overfitting vs. underfitting: a complete example:
https://towardsdatascience.com/overfitting-vs-underfitting-a-complete-example-d05dd7e19765
[8] Understanding random forest: https://towardsdatascience.com/understanding-random-forest-58381e0602d2
[9] Understanding confusion matrix: https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62
[10] GitLab repository for this article: https://gitlab.com/eos-acard/machine-learning-in-python
WEB LINKS