Open Source For You — December 2017

(Steven Felgate) #1
Guest Column CodeSport

http://www.OpenSourceForU.com | OPEN SOURCE FOR YOU | DECEMBER 2017 | 15

By: Sandya Mannarswamy
The author is an expert in systems software and is currently
working as a research scientist at Conduent Labs India
(formerly Xerox India Research Centre). Her interests include
compilers, programming languages, file systems and natural
language processing. If you are preparing for systems
software interviews, you may find it useful to visit Sandya’s
LinkedIn group ‘Computer Science Interview Training (India)‘
at http://www.linkedin.com/groups?home=&gid=2339182.

dissimilar. Given that one of the popular techniques for word
similarity measures these days is the use of word-embedding
techniques such as Word2Vec, it is highly probable that
‘Paris’ and ‘Athens’ end up getting mapped as reasonably
similar by the word-embedding techniques since they are both
European capital cities and often appear in similar contexts.
Let us consider another example.
Q1: What’s the fastest way to get from Los Angeles to
New York?
Q2: How do I get from Los Angeles to New York in the
least amount of time?
While there may not be good word-based text similarity
between the above two questions, the information needs of
both the questions are satisfied by a common answer and
hence this question pair needs to be marked as a duplicate.
Let us consider yet another example.
Q1: How do I invest in the share market?
Q2: How do I invest in the share market in India?
Though Q1 and Q2 have considerable text similarity, they
are not duplicates since Q2 is a more specific form of question
and, hence, cannot share the same answer as Q1.
These examples are meant to illustrate the challenges
involved in identifying duplicate questions. Having chosen our
task and defined it, now let us decide what would be our data
set. Last year, the CQA forum, Quora, had released a data set
for the duplicate question detection task. This data set was also
used in a Kaggle competition involving the same task. Hence let
us use this data set for our exploration. It is available at https://
http://www.kaggle.com/c/quora-question-pairs. So please download
the train.csv and test.csv files for your exploratory data analysis.
Given that this was run as a Kaggle competition, there are
a lot of forum discussions on Kaggle regarding the various
solutions to this task. While I would encourage readers to go
through them to enrich their knowledge, we are not going to
use any non-text features as we attempt to solve this problem.
For instance, many of the winners have used question ID
as a feature in their solution. Some others have used graph
features, such as learning the number of neighbours that
a duplicate question pair would have compared to a non-
duplicate question pair. However, we felt that these are
extraneous features to text and are quite dependent on the
data. Hence, in order to arrive at a reliable solution, we will
only look at text based features in our approaches.
As with any ML/NLP task, let us begin with some
exploratory data analysis. Here are a few questions to our
readers (Note: Most of these tasks are quite easy, and can be
done with simple commands in Python using Pandas. So I
urge you to try them out).



  1. Find out how many entries there are in train.csv?

  2. What are the columns present in train.csv?

  3. Can you find out whether this is a balanced data set or not?
    How many of the question pairs are duplicates?

  4. Are there any NaNs present in the entries for Question 1
    and Question 2 columns?
    5. Create a Bag of Words classifier and report the accuracy.
    I suggest that our readers (specifically those who have
    just started exploring ML and NLP) try these experiments and
    share the results in a Python Jupiter notebook. Please do send
    me the pointer to your notebook and we can discuss it in this
    column. Another exercise that is usually recommended is to
    go over the actual data and see what types of questions are
    marked as duplicate and what are not.
    It would also be good to do some initial text exploration of
    the data set. I suggest that readers use the Stanford CoreNLP
    tool kit for this purpose because it is more advanced in its text
    analysis compared to NLTK. Since Stanford CoreNLP is Java
    based, you need to run this as a server and use a client Python
    package such as https://pypi.python.org/pypi/stanford-corenlp/.
    Please try the following experiments on the Quora data set.

    1. Identify the different Named Entities present in the
      Quora train data set and test the data set. Can you cluster
      these identities?

    2. Stanford CoreNLP supports the parse tree. Can you use it
      for different types of questions such as ‘what’, ‘where’,
      ‘when’ and ‘how’ questions?
      While we can apply many of the classical machine learning
      techniques after identifying the appropriate features, I thought
      it would be more interesting to focus on some of the neural
      networks based approaches since the data set is sufficiently
      large (Quora actually used a random forest classifier initially).
      Next month, we will focus on some of the simple neural
      network based techniques to attack this problem.
      I also wanted to point out a couple of NLP problems related
      to this task. One is the task of textual entailment recognition
      where, given a premise statement and hypothesis statement, the
      task is to recognise whether the hypothesis follows from the
      premise, contradicts the premise or is neutral to the premise.
      Note that textual entailment is a 3-class classification problem.
      Another closely related task is that of paraphrase generation.
      Given two statements S1 and S2, the task is to identify whether
      S1 and S2 are paraphrases. Some of the techniques that have
      been applied for paraphrase identification and textual entailment
      recognition can be leveraged for our task of duplicate question
      identification. I’ll discuss more on this in next month’s column.
      If you have any favourite programming questions/
      software topics that you would like to discuss on this forum,
      please send them to me, along with your solutions and
      feedback, at sandyasm_AT_yahoo_DOT_com.



Free download pdf