Open Source For You — December 2017

(Steven Felgate) #1
Let’s Try Admin

http://www.OpenSourceForU.com | OPEN SOURCE FOR YOU | DECEMBER 2017 | 55

T


he world is being flooded with data from all sources.
The hottest trend in technology is related to Big Data
and the evolving field of data science is a way to cope
with this data deluge. Machine learning is at the heart of data
science. The need of the hour is to have efficient machine
learning frameworks and platforms to process Big Data.
Apache Spark is one of the most powerful platforms for
analysing Big Data. MLlib is its machine learning library,
and is potent enough to process Big Data and apply all
machine learning algorithms to it efficiently.

Apache Spark
Apache Spark is a cluster computing framework based on
Hadoop’s MapReduce framework. Spark has in-memory
cluster computing, which helps to speed up computation by
reducing the IO transfer time. It is widely used to deal with
Big Data problems because of its distributed architectural
support and parallel processing capabilities. Users prefer it to
Hadoop on account of its stream processing and interactive
query features. To provide a wide range of services, it
has built-in libraries like GraphX, SparkSQL and MLlib.
Spark supports Python, Scala, Java and R as programming
languages, out of which Scala is the most preferred.

Designated as Spark’s scalable machine learning library,
MLlib consists of common algorithms and utilities as well as
underlying optimisation primitives.

MLlib
MLlib is Spark’s machine learning library. It is predominantly
used in Scala but it is compatible with Python and Java as well.
MLlib was initially contributed by AMPLab at UC Berkeley. It
makes machine learning scalable, which provides an advantage
when handling large volumes of incoming data.
The main features of MLlib are listed below.
Machine learning algorithms: Regression, classification,
collaborative filtering, clustering, etc
Featurisation: Selection, dimensionality reduction,
transformation, feature extraction, etc
Pipelines: Construction, evaluation and tuning of ML pipelines
Persistence: Saving/loading of algorithms, models and pipelines
Utilities: Statistics, linear algebra, probability, data handling, etc
Some lower level machine learning primitives like the
generic gradient descent optimisation algorithm are also
present in MLlib. In the latest releases, the MLlib API is based
on DataFrames instead of RDD, for better performance.

The advantages of MLlib
The true power of Spark lies in its vast libraries, which are
capable of performing every data analysis task imaginable. MLlib
is at the core of this functionality. It has several advantages.

Spark’s MLlib:


Scalable Support for


Machine Learning

Free download pdf