Open Source For You — December 2017

(Steven Felgate) #1

Admin Let’s Try


56 | DECEMBER 2017 | OPEN SOURCE FOR YOU | http://www.OpenSourceForU.com

Ease of use: MLlib integrates well with four
languages— Java, R, Python and Scala. The APIs of
all four provide ease of use to programmers of various
languages as they don’t need to learn a new one.
Easy to deploy: No preinstallation or conversion
is required to use a Hadoop based data source such as
HBase, HDFS, etc. Spark can also run standalone or on
an EC2 cluster.
Scalability: The same code can work on small or
large volumes of data without the need of changing
it to suit the volume. As businesses grow, it is easy
to expand vertically or horizontally without breaking
down the code into modules for performance.
Performance: The ML algorithms run up to
100X faster than MapReduce on account of the
framework, which allows iterative computation.
MLlib’s algorithms take advantage of iterative
computing properties to deliver better performance,
surpassing that of MapReduce. The performance gain
is attributed to the in-memory computing, which is a
speciality of Spark.
Algorithms: The main ML algorithms included
in the MLlib module are classification, regression,
decision trees, recommendation, clustering, topic
modelling, frequent item sets, association rules,
etc. ML workflow utilities included are feature
transformation, pipeline construction, ML persistence,
etc. Single value decomposition, principal
component analysis, hypothesis testing, etc, are also
possible with this library.
Community: Spark is open source software under
the Apache Foundation now. It gets tested and updated
by the vast contributing community. MLlib is the most
rapidly expanding component and new features are
added every day. People submit their own algorithms
and the resources available are unparalleled.

Basic modules of MLlib
SciKit-Learn: This module contains many basic ML
algorithms that perform the various tasks listed below.
Classification: Random forest, nearest neighbour,
SVM, etc
Regression: Ridge regression, support vector
regression, lasso, logistic regression, etc
Clustering: Spectral clustering, k-means clustering, etc
Decomposition: PCA, non-negative matrix
factorisation, independent component analysis, etc

Mahout: This module contains many basic ML
algorithms that perform the tasks listed below.
Classification: Random forest, logistic regression,
naive Bayes, etc
Collaborative filtering: ALS, etc

[1] spark.apache.org/
[2] http://www.tutorialspoint.com/apache_spark/

References

Clustering: k-means, fuzzy k-means, etc
Decomposition: SVD, randomised SVD, etc

Spark MLlib use cases
Spark’s MLlib is used frequently in marketing optimisation,
security monitoring, fraud detection, risk assessment,
operational optimisation, preventative maintenance, etc.

Here are some popular use cases.
NBC Universal: International cable TV has tons of data.
To reduce costs, NBC takes its media offline when it is not
in use. Spark’s MLlib is used to implement SVM to predict
which files should be taken down.
ING: MLlib is used for its data analytics pipeline
to detect anomaly. Decision trees and k-means are
implemented by MLlib to enable this.
Toyota: Toyota’s Customer 360 insights platform uses
social media data in real-time to prioritise the customer
reviews and categorise them for business insights.

ML vs MLLib
There are two main machine learning packages —spark.
mllib and spark.ml. The former is the original version and
has its API built on top of RDD. The latter has a newer,
higher-level API built on top of DataFrames to construct
ML pipelines. The newer version is recommended due
to the DataFrames, which makes it more versatile and
flexible. The newer releases support the older version as
well, due to backward compatibility. MLlib, being older,
has more features as it was in development longer. Spark
ML allows you to create pipelines using machine learning
to transform the data. In short, ML is new, has pipelines,
DataFrames and is easier to construct. But MLlib is old,
has RDD and has more features.
MLlib is the main reason for the popularity and the
widespread use of Apache Spark in the Big Data world. Its
compatibility, scalability, ease of use, good features and
functionality have led to its success. It provides many inbuilt
functions and capabilities, which makes it easy for machine
learning programmers. Virtually all known machine learning
algorithms in use can be easily implemented using either
version of MLlib. In this era of data deluge, such libraries
certainly are a boon to data science.

By: Preet Gandhi
The author is an avid Big Data and data science enthusiast.
She can be reached at [email protected].
Free download pdf