Open Source For You — December 2017

(Steven Felgate) #1

CODE


SPORT


Sandya Mannarswamy

14 | DECEMBER 2017 | OPEN SOURCE FOR YOU | http://www.OpenSourceForU.com


W


hile we have been discussing many
questions in machine learning (ML) and
natural language processing (NLP), I had a
number of requests from our readers to take up a real
life ML/NLP problem with a sufficiently large data
set, discuss the issues related to this specific problem
and then go into designing a solution. I think it is
a very good suggestion. Hence, over the next few
columns, we will be focusing on one specific real life
NLP problem, which is detecting duplicate questions
in community question-answering (CQA) forums.
There are a number of popular CQA forums
such as Yahoo Answers, Quora and StackExchange
where netizens post their questions and get answers
from domain experts. CQA forums serve as a
common means of distilling crowd intelligence and
sharing it with millions of people. From a developer
perspective, sites such as StackOverflow fill an
important need by providing guidance and help
across the world, 24x7. Given the enormous number
of people who use such forums, and their varied skill
levels, many questions get asked again and again.
Since many users have similar informational
needs, answers to new questions can typically be
found either in whole or part from the existing
question-answer archive of these forums. Hence,
given a new incoming question, these forums
typically display a list of similar or related questions,
which could immediately satisfy the information
needs of users, without them having to wait for
their new question to be answered by other users.
Many of these forums use simple keyword/tag based
techniques for detecting duplicate questions.
However, often, these automated lists returned
by the forums are not accurate, frustrating users
looking for answers. Given the challenges in
identifying duplicate questions, some forums put in
manual effort to tag duplicate questions. However,
this is not scalable, given the rate at which new

questions get generated, and the need for specific
domain expertise to tag a question as duplicate.
Hence, there is a strong requirement for automated
techniques that can help in identifying questions that
are duplicates of an incoming question.
Note that identifying duplicate questions is
different from identifying ‘similar/related’ questions.
Identifying similar questions is somewhat easier as
it only requires that there should be considerable
similarity between a question pair. On the other
hand, in the case of duplicate questions, the answer
to one question can serve as the answer to the
second question. This identification requires stricter
and more rigorous analysis.
At first glance, it appears that we can use
various text similarity measures in NLP to identify
duplicate questions. Given that people express
their information needs in widely different forms,
it is a big challenge to identify the exact duplicate
questions automatically. For example, let us consider
the following two questions:
Q1: I am interested in trying out local cuisine.
Can you please recommend some local cuisine
restaurants that are wallet-friendly in Paris?
Q2: I like to try local cuisine whenever I travel.
I would like some recommendations for restaurants
which are not too costly, but serve authentic local
cuisine in Athens?
Now consider applying different forms of text
similarity measures. The above two questions
score very high on various similarity measures—
lexical, syntactic and semantic similarity. While
it is quite easy for humans to focus on the one
dissimilarity, which is that the locations discussed
in the two questions are different, it is not easy
to teach machines that ‘some dissimilarities are
more important than other dissimilarities.’ It also
raises the question of whether the two words ‘Paris’
and ‘Athens’ would be considered as extremely

In this month’s column, we discuss a real life NLP problem, namely,
detecting duplicate questions in community question-answering forums.
Free download pdf