Elektor_Mag_-_January-February_2021

lektor January & February 2021 105

Using the test_size parameter we can specify the percentage of data reserved for the test; the standard values for this parameter usually range between 0.2 and 0.3.

It is also important to normalize the data. We have already noticed that some features assume values with much higher variation than others, and this will result in them being given more weight. Normal- izing them allows you to bring them within a single range of values so that there are no imbalances due to initial offsets. To do this, we use StandardScaler: scaler = StandardScaler() X_train = pd.DataFrame(scaler.fit_transform(X_train), index=X_train.index, columns=X_train.columns) X_test = pd.DataFrame(scaler.fit_transform(X_test), index=X_test.index, columns=X_test.columns)

It is interesting to note the usefulness of the common interface offered by Scikit-Learn. Both scaler and imputer use the fit_transform method to process data that, in complex pipelines, greatly simplifies code writing and library understanding.

We are now finally ready to classify the data. In particular, we will use a random forest [8], obtaining, after training, a model able to distinguish between normal and abnormal situations. We will verify the perfor- mance of the model identified in two ways. The first is the accuracy score. This is the percentage of samples belonging to the test set correctly classified by the algorithm. The second is the confusion matrix [9] that highlights the number of false positives and false negatives.

First, we create the classifier: clf = RandomForestClassifier(n_estimators=500, max_depth=4)

This creates a random forest with 500 estimators whose maximum depth is 4 levels. Now we can train our model on training data: clf.fit(X_train, y_train)

sns.heatmap(secom.isnull(), cbar=False)

It is evident that many samples show a high percentage of null values
that should not be considered in order to remove the bias effect on
the data.
na_cols = [col for col in secom.columns if secom[col].
isnull().sum() / len(secom) > 0.4]
secom_clean = secom.drop(na_cols, axis=1)
secom_clean.head()

Thanks to the previous commands, a comprehension list has now
isolated all features with more than 40% null values, allowing them
to be removed from the dataset.
The features with less than 40% of null values still need to be dealt
with. Here we can use our first Scikit-Learn object, the SimpleIm-
puter, which assigns values to all NaNs based on a user-defined strat-
egy. In this case, we will use an average (mean) strategy, associating
the average value assumed by the feature to each NaN.
imputer = SimpleImputer(strategy=’mean’)
secom_imputed = pd.DataFrame(imputer.
fit_transform(secom_clean))
secom_imputed.columns = secom_clean.columns

As an exercise, we can verify that we have no zero values in the
dataset through another heatmap (which, predictably, will assume a
uniformly dark color). Then we can move on to the actual processing.

Data processing
We will divide our dataset into two sub-sets: one training and one test.
This subdivision is necessary to mitigate the phenomenon of overfit-
ting, which makes the algorithm ’adhere too much’ to the data (see
here for more background [7]), and ensures the the model’s applica-
bility to cases different from the ones upon which it was trained. To
do this, we use the train_test_split function:
X_train, X_test, y_train, y_test = train_testsplit(secom
imputed, labels, test_size=0.3)

Figure 6: Map of null values in SECOM dataset.

Elektor_Mag_-_January-February_2021

Get our desktop app

Company

Features

Documentation

Resources