Open Source For You — December 2017

Let’s Try Admin

http://www.OpenSourceForU.com | OPEN SOURCE FOR YOU | DECEMBER 2017 | 55

T

he world is being flooded with data from all sources. The hottest trend in technology is related to Big Data and the evolving field of data science is a way to cope with this data deluge. Machine learning is at the heart of data science. The need of the hour is to have efficient machine learning frameworks and platforms to process Big Data. Apache Spark is one of the most powerful platforms for analysing Big Data. MLlib is its machine learning library, and is potent enough to process Big Data and apply all machine learning algorithms to it efficiently.

Apache Spark Apache Spark is a cluster computing framework based on Hadoop’s MapReduce framework. Spark has in-memory cluster computing, which helps to speed up computation by reducing the IO transfer time. It is widely used to deal with Big Data problems because of its distributed architectural support and parallel processing capabilities. Users prefer it to Hadoop on account of its stream processing and interactive query features. To provide a wide range of services, it has built-in libraries like GraphX, SparkSQL and MLlib. Spark supports Python, Scala, Java and R as programming languages, out of which Scala is the most preferred.

Designated as Spark’s scalable machine learning library, MLlib consists of common algorithms and utilities as well as underlying optimisation primitives.

MLlib MLlib is Spark’s machine learning library. It is predominantly used in Scala but it is compatible with Python and Java as well. MLlib was initially contributed by AMPLab at UC Berkeley. It makes machine learning scalable, which provides an advantage when handling large volumes of incoming data. The main features of MLlib are listed below. Machine learning algorithms: Regression, classification, collaborative filtering, clustering, etc Featurisation: Selection, dimensionality reduction, transformation, feature extraction, etc Pipelines: Construction, evaluation and tuning of ML pipelines Persistence: Saving/loading of algorithms, models and pipelines Utilities: Statistics, linear algebra, probability, data handling, etc Some lower level machine learning primitives like the generic gradient descent optimisation algorithm are also present in MLlib. In the latest releases, the MLlib API is based on DataFrames instead of RDD, for better performance.

The advantages of MLlib The true power of Spark lies in its vast libraries, which are capable of performing every data analysis task imaginable. MLlib is at the core of this functionality. It has several advantages.

Spark’s MLlib:

Scalable Support for

Machine Learning

Open Source For You — December 2017

Get our desktop app

Company

Features

Documentation

Resources