lektor January & February 2021 103
Visualizing the first lines of the dataframe can be useful to have a
first overview of the data to analyze. In this case we immediately
notice the presence of some values equal to ‘?’ that presumably
represent null values. Moreover, it is evident that the range of the
values vary greatly, a factor that we will have to keep in mind later
on. We can also use the describe() function to get a quick overview
of the statistical characteristics of each variable (Figure 4 ).
data.describe()
Statistical analysis can, in general, highlight conditions with a lack
of normalcy (i.e., data distributed according to a non-parametric
distribution), or the presence of anomalies. To offer an example,
we note that the standard deviation (std) associated with the
variables a116 and a118 is, proportionally, quite high, so we expect
a high significance of these variables in their analysis. On the other
hand, variables such as a114 have a low std, so they are expected to
be discarded as they are not very explanatory with respect to the
process being analyzed.
Once the loading and display of the dataframe is complete, we
can move on to a fundamental part of the pipeline: preprocessing.
Preprocessing data
As a first step we display the number of samples associated with
each class. To do so, we will use the value_counts() function on
the classvalue column as it contains the labels associated with
each sample.
data[’classvalue’]. value_counts()
We see that there are 1463 samples collected in the normal operat-
ing situation (class -1) and 104 in the failure situation (class 1). The
dataset is therefore strongly imbalanced and it would be appropriate
to undertake steps to make the distribution of samples between the
different classes more ‘uniform’. This relates back to the intrinsic
functionality of machine learning algorithms that learn on the basis
of the data available to them. In this specific case, the algorithm will
learn to characterize a situation of standard behavior successfully,
but will have ’uncertainties’ in characterizing abnormal situations.
for inserting comments and descriptions in the format used, for
example, by GitHub READMEs), Raw NBContent (plain text) and
Heading (offering a shortcut to insert titles).
Importing and displaying data
Once we are familiar with the interface we can move on to imple-
ment our script. Here we import the libraries and modules that
we will use:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from ipywidgets import interact
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix,
accuracy_score
from sklearn.utils import resample
It is worth highlighting the instruction %matplotlib inline that
allows us to display the graphs produced by Matplotlib correctly.
Next, the data of the file containing the SECOM dataset is imported
using the Pandas read_csv function. Note that, in this example,
the relative path to the file is hardcoded for the sake of simplicity.
However, it would be advisable to use the Python os package to
allow our program to determine this path itself when required.
data = pd.read_csv(’data/secom.csv’)
The previous instruction reads in the data contained in the secom.
csv file, organizing it in a dataframe named data. We can display
the first five lines of the dataframe through the head() instruction,
as shown in Figure 3 .
data.head()
Figure 3: The first five lines of the SECOM dataset.
Figure 4: Short statistical description of the SECOM dataset.