104 January & February 2021 http://www.elektormagazine.com
column into numeric values. By combining them we generate unique
data and we will also remove the values that cannot be handled by
Numpy and Scikit-Learn:
data.apply(pd.to_numeric)
We now need to evaluate which features among those contained in
the dataset are actually useful. We typically use techniques (of higher
or lower complexity) of feature selection to reduce redundancies and
the size of the problem to be treated, delivering obvious benefits in
terms of processing time and performance of the algorithm. In our
case, we rely on a less complex technique that involves the elimina-
tion of features of low variance (and therefore, as mentioned above,
of low significance). We create, therefore, an interactive widget that
allows us to visualize, in the form of a histogram, the distribution
of data for each feature:
@interact(col=(0, len(df.columns) - 1)
def show_hist(col=1):
data[’a’ + str(col)]. value_counts().hist(grid=False,
figsize=(8, 6))
Interactivity is ensured by the decorator @interact, whose reference
value (i.e. col) varies between 0 and the number of features present
in the dataset. Exploring the displayed data through the widget, we
will determine how many features assume a single value, meaning
they can be simply overlooked in the analysis. We can then decide
to eliminate them as follows:
single_val_cols = data.columns[len(data)/data.nunique()
< 2]
secom = data.drop(single_val_cols, axis=1)
Of course, there are more relevant and refined feature selection
techniques using, for example, statistical parameters. For a complete
overview, the Scikit-Learn documentation can be consulted [6].
The last step is to deal with the null values (which we replaced previ-
ously with np.nan). We inspect the dataset to see how many there
are; to do so, we use a heatmap, as shown in Figure 6 , where the
white points represent the null values.
The unbalance is even more evident when looking at the scatterplot
(shown in Figure 5):
sns.scatterplot(data.index, data[’classvalue’], alpha=0.2)
plt.show()
With this imbalance in mind (we’ll come back to it later), we proceed
to ‘separate’ the labels from the data:
labels = data[’classvalue’]
data.drop(’classvalue’, axis=1, inplace=True)
Note the use of the axis parameter in the drop function that allows
us to specify that the function must operte on the columns of the
dataframe (by default, Pandas functions operate on the rows).
Another aspect that can be extrapolated from the dataset analysis is
that, in this specific version of SECOM data, many columns contain
data of different types (i.e. both strings and numbers). As a result,
Pandas is unable to uniquely determine the type of data with which
each feature is represented and defers the definition of this to the
user. Therefore, to bring all data into numerical format, it is neces-
sary to use three functions offered by Pandas.
The first function we will use is replace(), with which we can replace
all question marks with the constant value numpy.nan, the place-
holder used to handle null values in Numpy arrays.
data = data.replace(’?’, np.nan, regex=False)
The first parameter of the function is the value to replace, the second
is the value to use for the replacement, and the third is a flag indicat-
ing whether or not the first parameter represents a regular expres-
sion. We could also use an alternative syntax using the inplace
parameter set to True, as follows:
data.replace(’?’, np.nan, regex=False, inplace=True)
The second and third functions that we can use to solve the problems
highlighted above are the apply() and to_numeric() functions
respectively. The first allows you to apply a certain function to all
columns (or rows) of a dataframe, while the second converts a single
Figure 5: Number of samples per class in SECOM dataset.