Elektor_Mag_-_January-February_2021

lektor January & February 2021 103

Visualizing the first lines of the dataframe can be useful to have a first overview of the data to analyze. In this case we immediately notice the presence of some values equal to ‘?’ that presumably represent null values. Moreover, it is evident that the range of the values vary greatly, a factor that we will have to keep in mind later on. We can also use the describe() function to get a quick overview of the statistical characteristics of each variable (Figure 4 ). data.describe()

Statistical analysis can, in general, highlight conditions with a lack of normalcy (i.e., data distributed according to a non-parametric distribution), or the presence of anomalies. To offer an example, we note that the standard deviation (std) associated with the variables a116 and a118 is, proportionally, quite high, so we expect a high significance of these variables in their analysis. On the other hand, variables such as a114 have a low std, so they are expected to be discarded as they are not very explanatory with respect to the process being analyzed.

Once the loading and display of the dataframe is complete, we can move on to a fundamental part of the pipeline: preprocessing.

Preprocessing data As a first step we display the number of samples associated with each class. To do so, we will use the value_counts() function on the classvalue column as it contains the labels associated with each sample. data[’classvalue’]. value_counts()

We see that there are 1463 samples collected in the normal operat- ing situation (class -1) and 104 in the failure situation (class 1). The dataset is therefore strongly imbalanced and it would be appropriate to undertake steps to make the distribution of samples between the different classes more ‘uniform’. This relates back to the intrinsic functionality of machine learning algorithms that learn on the basis of the data available to them. In this specific case, the algorithm will learn to characterize a situation of standard behavior successfully, but will have ’uncertainties’ in characterizing abnormal situations.

for inserting comments and descriptions in the format used, for
example, by GitHub READMEs), Raw NBContent (plain text) and
Heading (offering a shortcut to insert titles).

Importing and displaying data
Once we are familiar with the interface we can move on to imple-
ment our script. Here we import the libraries and modules that
we will use:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from ipywidgets import interact
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix,
accuracy_score
from sklearn.utils import resample

It is worth highlighting the instruction %matplotlib inline that
allows us to display the graphs produced by Matplotlib correctly.

Next, the data of the file containing the SECOM dataset is imported
using the Pandas read_csv function. Note that, in this example,
the relative path to the file is hardcoded for the sake of simplicity.
However, it would be advisable to use the Python os package to
allow our program to determine this path itself when required.
data = pd.read_csv(’data/secom.csv’)

The previous instruction reads in the data contained in the secom.
csv file, organizing it in a dataframe named data. We can display
the first five lines of the dataframe through the head() instruction,
as shown in Figure 3 .
data.head()

Figure 3: The first five lines of the SECOM dataset.

Figure 4: Short statistical description of the SECOM dataset.

Elektor_Mag_-_January-February_2021

Get our desktop app

Company

Features

Documentation

Resources