Open Source For You — December 2017

(Steven Felgate) #1
102 | DECEMBER 2017 | OPEN SOURCE FOR YOU | http://www.OpenSourceForU.com

For U & Me Insight


the test. All the users in the group exposed to Variation
B are referred to as the treatment group. This technique
is used to optimise a conversion rate by measuring the
performance of the treatment against that of the control
using some mathematical calculations.
This testing methodology removes the possible uesswork from
the website optimisation process, and hence enables various
data-informed decisions which shift the business conversations
from what ‘we think’ to what ‘we know’. We can make sure
that each change produces positive results just by measuring
the impact that various changes have on our metrics.


  1. Natural language processing: This area of computational
    linguistics is linked to the interactions between different
    computers and human languages. In particular, it is
    concerned with programming several computers to process
    large natural language corpora. The different challenges
    in natural language processing are natural language
    generation, natural language understanding, connecting the
    machine and language perception or some combinations
    thereof. Natural language processing research has mostly
    relied on machine learning. Initially, there were many
    language-processing tasks which involved direct hand
    coding of rules. Nowadays, different machine learning
    pattern calls are being used instead of the statistical
    inference to automatically learn various rules by analysing
    large sets of data from real-life examples. Many different
    classes of machine learning algorithms have been used for
    NLP tasks. These algorithms utilise large sets of ‘features’
    as inputs. These features are developed from the input data
    set. Recent research has focused more on statistical models,
    which take probabilistic decisions based on attaching the
    real-valued weights to each input feature. Such models
    really have the edge because they can easily express the
    relative certainty for more than one different possible
    answer rather than only one, therefore producing more
    reliable results, compared to when such a model is included
    as only one of the many components of a larger system.


How can Big Data benefit your business?
Big Data may seem to be out of reach for different non-profit
and government agencies that do not have the funds to buy
into this new trend. We all have an impression that ‘big’
usually means expensive, but Big Data is not really about
using more resources; rather, it’s about the effective usage
of the resources at hand. Hence, organisations with limited
financial resources can also stay competitive and grow. For

How is Big Data analysed?
We all know that we cannot analyse Big Data manually, as it’s
a highly challenging and tedious task. In order to make this
task easier, there are several techniques that help us to analyse
the large sets of data very easily. Let us look at some of the
famous techniques being used for data analysis.


  1. Association rule learning: This is a rule-based Big
    Data analysis technique which is used to discover the
    interesting relations between different variables present
    in large databases. It is intended to identify the strong
    rules that are discovered in the databases using different
    measures of what is considered ‘interesting’. It makes use
    of a set of techniques for discovering several interesting
    relationships, also called ‘association rules’, among all the
    different variables present in the large databases.
    All such techniques use a variety of algorithms in order to
    generate and then test different possible rules. One of its
    most common applications is the market basket analysis.
    This helps a retailer to determine the several products
    frequently bought together and use that information for
    more focused marketing (like the discovery that most of the
    supermarket shoppers who buy diapers also go to buy beer,
    etc). Association rules are widely being used nowadays in
    continuous production, Web usage mining, bioinformatics
    and intrusion detection. These rules do not take into
    consideration the order of different items either within the
    same transaction or across different transactions.

  2. A/B testing: This is a technique that compares the two
    different versions of an application to determine which
    one performs better. It is also called split testing or
    bucket testing. It actually refers to a specific type of
    the randomised experiment under which a set of users
    are presented with two variations of the same product
    (advertisements, emails, Web pages, etc) – let’s call
    them Variation A and Variation B. All the users exposed
    to Variation A are often referred to as the control group,
    since its performance is considered as the baseline against
    which any improvement in performance observed from
    presenting the Variation B is measured. Also, at times,
    Variation A itself acts as the original version of the
    product which is being tested against what existed before


Figure 3: Different types of Big Data (Image source: googleimages.com)


Figure 4: Different processes involved in a Big Data system
(Image source: googleimages.com)
Free download pdf