Elektor_Mag_-_January-February_2021

([email protected]) #1
lektor January & February 2021 101

the corresponding computer vector but should instead be under-
stood in the algebraic and geometric sense as a matrix. Since data
analysis is based on algebraic and matrix operations, Numpy is
also the basis for two of the most used frameworks: Scikit-Learn
(covered shortly) and TensorFlow.

A natural complement to Numpy is Pandas, a library that manages
and reads data from heterogeneous sources, including Excel
spreadsheets, CSV files, or even JSON and SQL databases. Pandas
is extremely flexible and powerful, allowing you to organize data into
structures called dataframes that can be manipulated as required
and exported with ease directly into Numpy arrays.

The third library we will use is Scikit-Learn. Born from an academic
project, Scikit-Learn is a framework that implements most of
the machine learning algorithms used nowadays by providing a
common interface. The latter concept is precisely that of object-ori-
ented programming: it is, in fact, possible to use virtually every
algorithm offered by Scikit-Learn through the fit_transform
method by passing at least two parameters, such as the data under
analysis and the labels associated with it.

The last two libraries we will use are Matplotlib and Jupyter. The
first one, together with its complement Seaborn, is necessary to
visualize the results of our experiments in the form of graphs. The
second offers us the use of notebooks, interactive environments of
simple and immediate use that allows the data analyst to write and
execute parts of code independently from the others.

Before proceeding further, however, we will introduce some theoret-
ical concepts that are needed to build a ‘common base’ for discourse.

The concepts
The first concept required is that of datasets, something that is
often simply taken for granted. These are sets of samples, each of
them characterized by a certain number of variables or features,
that describe the phenomenon under observation. For simplicity

between minor releases of the interpreter, resulting in libraries (and,
consequently, programs) that are incompatible as they were written
for different Python versions. By having a deterministic environ-
ment where we know the version of each single installed library
provides a sort of ‘guarantee’ that our programs will function. In
fact, it will be enough to replicate precisely the configuration of the
virtual environment and we can be sure that everything will work.


To manage our virtual environments we use a package called
virtualenvwrapper. This can be installed from the shell by using pip:
$ pip install virtualenvwrapper


Once the installation is complete we create a new virtual environ-
ment as follows:
$ mkvirtualenv ml-python


Note that ml-python is the name of the virtual environment chosen
for our example scenario. Obviously, such names can vary and an
appropriate name can be chosen by the developer. We proceed
next by activating the virtual environment:
$ workon ml-python


We are now ready to install the elements necessary to follow the
rest of the article.


Libraries
The libraries that we present and use here are the five most-used
for data analysis in Python.


The first, and perhaps most famous, is Numpy, which can be consid-
ered a sort of port of MATLAB for Python. Numpy is a library for
algebraic and matrix calculations. As a result, those who regularly
use MATLAB will find many similarities, both in terms of syntax and
optimization. Using algebraic calculation in Numpy is, in fact, more
efficient than nested cycles in MATLAB (to learn more, I leave you
the link to this article [4]). Predictably, the type of data at the core of
Numpy’s functionality is the array. This is not to be confused with

Free download pdf