Elektor_Mag_-_January-February_2021

104 January & February 2021 http://www.elektormagazine.com

column into numeric values. By combining them we generate unique data and we will also remove the values that cannot be handled by Numpy and Scikit-Learn: data.apply(pd.to_numeric)

We now need to evaluate which features among those contained in the dataset are actually useful. We typically use techniques (of higher or lower complexity) of feature selection to reduce redundancies and the size of the problem to be treated, delivering obvious benefits in terms of processing time and performance of the algorithm. In our case, we rely on a less complex technique that involves the elimina- tion of features of low variance (and therefore, as mentioned above, of low significance). We create, therefore, an interactive widget that allows us to visualize, in the form of a histogram, the distribution of data for each feature: @interact(col=(0, len(df.columns) - 1) def show_hist(col=1): data[’a’ + str(col)]. value_counts().hist(grid=False, figsize=(8, 6))

Interactivity is ensured by the decorator @interact, whose reference value (i.e. col) varies between 0 and the number of features present in the dataset. Exploring the displayed data through the widget, we will determine how many features assume a single value, meaning they can be simply overlooked in the analysis. We can then decide to eliminate them as follows: single_val_cols = data.columns[len(data)/data.nunique() < 2] secom = data.drop(single_val_cols, axis=1)

Of course, there are more relevant and refined feature selection techniques using, for example, statistical parameters. For a complete overview, the Scikit-Learn documentation can be consulted [6].

The last step is to deal with the null values (which we replaced previ- ously with np.nan). We inspect the dataset to see how many there are; to do so, we use a heatmap, as shown in Figure 6 , where the white points represent the null values.

The unbalance is even more evident when looking at the scatterplot
(shown in Figure 5):
sns.scatterplot(data.index, data[’classvalue’], alpha=0.2)
plt.show()

With this imbalance in mind (we’ll come back to it later), we proceed
to ‘separate’ the labels from the data:
labels = data[’classvalue’]
data.drop(’classvalue’, axis=1, inplace=True)

Note the use of the axis parameter in the drop function that allows
us to specify that the function must operte on the columns of the
dataframe (by default, Pandas functions operate on the rows).

Another aspect that can be extrapolated from the dataset analysis is
that, in this specific version of SECOM data, many columns contain
data of different types (i.e. both strings and numbers). As a result,
Pandas is unable to uniquely determine the type of data with which
each feature is represented and defers the definition of this to the
user. Therefore, to bring all data into numerical format, it is neces-
sary to use three functions offered by Pandas.

The first function we will use is replace(), with which we can replace
all question marks with the constant value numpy.nan, the place-
holder used to handle null values in Numpy arrays.
data = data.replace(’?’, np.nan, regex=False)

The first parameter of the function is the value to replace, the second
is the value to use for the replacement, and the third is a flag indicat-
ing whether or not the first parameter represents a regular expres-
sion. We could also use an alternative syntax using the inplace
parameter set to True, as follows:
data.replace(’?’, np.nan, regex=False, inplace=True)

The second and third functions that we can use to solve the problems
highlighted above are the apply() and to_numeric() functions
respectively. The first allows you to apply a certain function to all
columns (or rows) of a dataframe, while the second converts a single

Figure 5: Number of samples per class in SECOM dataset.

Elektor_Mag_-_January-February_2021

Get our desktop app

Company

Features

Documentation

Resources