Elektor_Mag_-_January-February_2021

([email protected]) #1
lektor January & February 2021 105

Using the test_size parameter we can specify the percentage of data
reserved for the test; the standard values for this parameter usually
range between 0.2 and 0.3. 

It is also important to normalize the data. We have already noticed
that some features assume values with much higher variation than
others, and this will result in them being given more weight. Normal-
izing them allows you to bring them within a single range of values
so that there are no imbalances due to initial offsets. To do this, we
use StandardScaler: 
scaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train),
index=X_train.index, columns=X_train.columns)
X_test = pd.DataFrame(scaler.fit_transform(X_test),
index=X_test.index, columns=X_test.columns)

It is interesting to note the usefulness of the common interface offered
by Scikit-Learn. Both scaler and imputer use the fit_transform
method to process data that, in complex pipelines, greatly simplifies
code writing and library understanding. 

We are now finally ready to classify the data. In particular, we will use a
random forest [8], obtaining, after training, a model able to distinguish
between normal and abnormal situations. We will verify the perfor-
mance of the model identified in two ways. The first is the accuracy
score. This is the percentage of samples belonging to the test set
correctly classified by the algorithm. The second is the confusion matrix
[9] that highlights the number of false positives and false negatives.

First, we create the classifier: 
clf = RandomForestClassifier(n_estimators=500,
max_depth=4)

This creates a random forest with 500 estimators whose maximum
depth is 4 levels. Now we can train our model on training data: 
clf.fit(X_train, y_train)

sns.heatmap(secom.isnull(), cbar=False)


It is evident that many samples show a high percentage of null values
that should not be considered in order to remove the bias effect on
the data. 
na_cols = [col for col in secom.columns if secom[col].
isnull().sum() / len(secom) > 0.4] 
secom_clean = secom.drop(na_cols, axis=1)
secom_clean.head()


Thanks to the previous commands, a comprehension list has now
isolated all features with more than 40% null values, allowing them
to be removed from the dataset.
The features with less than 40% of null values still need to be dealt
with. Here we can use our first Scikit-Learn object, the SimpleIm-
puter, which assigns values to all NaNs based on a user-defined strat-
egy. In this case, we will use an average (mean) strategy, associating
the average value assumed by the feature to each NaN. 
imputer = SimpleImputer(strategy=’mean’)
secom_imputed = pd.DataFrame(imputer.
fit_transform(secom_clean))
secom_imputed.columns = secom_clean.columns


As an exercise, we can verify that we have no zero values in the
dataset through another heatmap (which, predictably, will assume a
uniformly dark color). Then we can move on to the actual processing. 


Data processing
We will divide our dataset into two sub-sets: one training and one test.
This subdivision is necessary to mitigate the phenomenon of overfit-
ting, which makes the algorithm ’adhere too much’ to the data (see
here for more background [7]), and ensures the the model’s applica-
bility to cases different from the ones upon which it was trained. To
do this, we use the train_test_split function:
X_train, X_test, y_train, y_test = train_testsplit(secom
imputed, labels, test_size=0.3)


Figure 6: Map of null
values in SECOM
dataset.
Free download pdf