Elektor_Mag_-_January-February_2021

lektor January & February 2021 101

the corresponding computer vector but should instead be under- stood in the algebraic and geometric sense as a matrix. Since data analysis is based on algebraic and matrix operations, Numpy is also the basis for two of the most used frameworks: Scikit-Learn (covered shortly) and TensorFlow.

A natural complement to Numpy is Pandas, a library that manages and reads data from heterogeneous sources, including Excel spreadsheets, CSV files, or even JSON and SQL databases. Pandas is extremely flexible and powerful, allowing you to organize data into structures called dataframes that can be manipulated as required and exported with ease directly into Numpy arrays.

The third library we will use is Scikit-Learn. Born from an academic project, Scikit-Learn is a framework that implements most of the machine learning algorithms used nowadays by providing a common interface. The latter concept is precisely that of object-ori- ented programming: it is, in fact, possible to use virtually every algorithm offered by Scikit-Learn through the fit_transform method by passing at least two parameters, such as the data under analysis and the labels associated with it.

The last two libraries we will use are Matplotlib and Jupyter. The first one, together with its complement Seaborn, is necessary to visualize the results of our experiments in the form of graphs. The second offers us the use of notebooks, interactive environments of simple and immediate use that allows the data analyst to write and execute parts of code independently from the others.

Before proceeding further, however, we will introduce some theoret- ical concepts that are needed to build a ‘common base’ for discourse.

The concepts The first concept required is that of datasets, something that is often simply taken for granted. These are sets of samples, each of them characterized by a certain number of variables or features, that describe the phenomenon under observation. For simplicity

between minor releases of the interpreter, resulting in libraries (and,
consequently, programs) that are incompatible as they were written
for different Python versions. By having a deterministic environ-
ment where we know the version of each single installed library
provides a sort of ‘guarantee’ that our programs will function. In
fact, it will be enough to replicate precisely the configuration of the
virtual environment and we can be sure that everything will work.

To manage our virtual environments we use a package called
virtualenvwrapper. This can be installed from the shell by using pip:
$ pip install virtualenvwrapper

Once the installation is complete we create a new virtual environ-
ment as follows:
$ mkvirtualenv ml-python

Note that ml-python is the name of the virtual environment chosen
for our example scenario. Obviously, such names can vary and an
appropriate name can be chosen by the developer. We proceed
next by activating the virtual environment:
$ workon ml-python

We are now ready to install the elements necessary to follow the
rest of the article.

Libraries
The libraries that we present and use here are the five most-used
for data analysis in Python.

The first, and perhaps most famous, is Numpy, which can be consid-
ered a sort of port of MATLAB for Python. Numpy is a library for
algebraic and matrix calculations. As a result, those who regularly
use MATLAB will find many similarities, both in terms of syntax and
optimization. Using algebraic calculation in Numpy is, in fact, more
efficient than nested cycles in MATLAB (to learn more, I leave you
the link to this article [4]). Predictably, the type of data at the core of
Numpy’s functionality is the array. This is not to be confused with

Elektor_Mag_-_January-February_2021

Get our desktop app

Company

Features

Documentation

Resources