Open Source For You — December 2017

Admin Let’s Try

56 | DECEMBER 2017 | OPEN SOURCE FOR YOU | http://www.OpenSourceForU.com

Ease of use: MLlib integrates well with four languages— Java, R, Python and Scala. The APIs of all four provide ease of use to programmers of various languages as they don’t need to learn a new one. Easy to deploy: No preinstallation or conversion is required to use a Hadoop based data source such as HBase, HDFS, etc. Spark can also run standalone or on an EC2 cluster. Scalability: The same code can work on small or large volumes of data without the need of changing it to suit the volume. As businesses grow, it is easy to expand vertically or horizontally without breaking down the code into modules for performance. Performance: The ML algorithms run up to 100X faster than MapReduce on account of the framework, which allows iterative computation. MLlib’s algorithms take advantage of iterative computing properties to deliver better performance, surpassing that of MapReduce. The performance gain is attributed to the in-memory computing, which is a speciality of Spark. Algorithms: The main ML algorithms included in the MLlib module are classification, regression, decision trees, recommendation, clustering, topic modelling, frequent item sets, association rules, etc. ML workflow utilities included are feature transformation, pipeline construction, ML persistence, etc. Single value decomposition, principal component analysis, hypothesis testing, etc, are also possible with this library. Community: Spark is open source software under the Apache Foundation now. It gets tested and updated by the vast contributing community. MLlib is the most rapidly expanding component and new features are added every day. People submit their own algorithms and the resources available are unparalleled.

Basic modules of MLlib SciKit-Learn: This module contains many basic ML algorithms that perform the various tasks listed below. Classification: Random forest, nearest neighbour, SVM, etc Regression: Ridge regression, support vector regression, lasso, logistic regression, etc Clustering: Spectral clustering, k-means clustering, etc Decomposition: PCA, non-negative matrix factorisation, independent component analysis, etc

Mahout: This module contains many basic ML algorithms that perform the tasks listed below. Classification: Random forest, logistic regression, naive Bayes, etc Collaborative filtering: ALS, etc

[1] spark.apache.org/ [2] http://www.tutorialspoint.com/apache_spark/

References

Clustering: k-means, fuzzy k-means, etc Decomposition: SVD, randomised SVD, etc

Spark MLlib use cases Spark’s MLlib is used frequently in marketing optimisation, security monitoring, fraud detection, risk assessment, operational optimisation, preventative maintenance, etc.

Here are some popular use cases. NBC Universal: International cable TV has tons of data. To reduce costs, NBC takes its media offline when it is not in use. Spark’s MLlib is used to implement SVM to predict which files should be taken down. ING: MLlib is used for its data analytics pipeline to detect anomaly. Decision trees and k-means are implemented by MLlib to enable this. Toyota: Toyota’s Customer 360 insights platform uses social media data in real-time to prioritise the customer reviews and categorise them for business insights.

ML vs MLLib There are two main machine learning packages —spark. mllib and spark.ml. The former is the original version and has its API built on top of RDD. The latter has a newer, higher-level API built on top of DataFrames to construct ML pipelines. The newer version is recommended due to the DataFrames, which makes it more versatile and flexible. The newer releases support the older version as well, due to backward compatibility. MLlib, being older, has more features as it was in development longer. Spark ML allows you to create pipelines using machine learning to transform the data. In short, ML is new, has pipelines, DataFrames and is easier to construct. But MLlib is old, has RDD and has more features. MLlib is the main reason for the popularity and the widespread use of Apache Spark in the Big Data world. Its compatibility, scalability, ease of use, good features and functionality have led to its success. It provides many inbuilt functions and capabilities, which makes it easy for machine learning programmers. Virtually all known machine learning algorithms in use can be easily implemented using either version of MLlib. In this era of data deluge, such libraries certainly are a boon to data science.

By: Preet Gandhi The author is an avid Big Data and data science enthusiast. She can be reached at [email protected].

Open Source For You — December 2017

Get our desktop app

Company

Features

Documentation

Resources