Language and the Internet

The language of the Web

Brother-in-Law O’Toolewould be normalized in different ways by
different IR systems. And it gets worse, if O’Toole turns out to be the
author of a particular version of a software program, as inBrother-
in-Law O’Toole’s ‘Q & A’ System/Version 1.0.Fewofuswould
know what to expect of any software system processing this search
The stop words recognized by different systems pose a special
problem. These usually comprise a list of the grammatical words
which are so frequent and contain so little semantic content that the
search mechanism ignores them. The trouble is that these words of-
ten form an obligatory part of something which does have semantic
content (such as the title of a novel or film) or are homographic
with content words – in which case they become irretrievable. For
example, the Dutch firm for which the ALFIE project (see fn. 23)
was undertaken was calledAND(the initials of its founders); as
andwould be on any stop-list, a search engine which is not case-
sensitive would make this string virtually impossible to find among
the welter of hits in which the wordandis prominent. TheAND
case is not unique, as anyone knows who has tried searching for
the discipline ofIT– let alone for the Stephen King novel,It.Sev-
eral forms which are grammatical in one context become content
items in another, such asainVitamin A,A-team, and the Andy
Warhol novela,orwhoinDoctor Who,aswellasthepolysemy
involved in such words aswillandmay(cf.May). Finding US
states by abbreviation, under these circumstances, can be tricky:
there is no problem with such states as KY (Kentucky) and TX
(Texas), but it would be unwise to try searching for Indiana
(IN), Maine (ME), or Oregon (OR), or even for Ohio (OH) and
Oklahoma (OK). Cross-linguistic differences add further compli-
cations: those computers which blockanandorin English exclude
the words for ‘year’ and ‘gold’ in French (as well as a significant part
of English heraldry, where the termoris crucial). C. L. Borgman

