Language and the Internet

(Martin Jones) #1

214 LANGUAGE AND THE INTERNET


the more mundane task of matching query term to inde xterm. In
an IR system hosting unrestricted text, the task of matching one
string of characters to another string of characters would be very
difficult unless there was a normalizing algorithm that processed
both the document text and the query text.

But for every normalization decision that has negligible conse-
quences for linguistic meaning (such as standardizing the amount
of blank space between paragraphs), there are several which result
in the loss of important linguistic detail. If careful attention is not
paid to punctuation, hyphenation, capitalization, and special sym-
bols (such as &, /,∗, $) valuable discriminating information can be
lost. When contrasts from these areas are ignored in searching, as is
often the case, all kinds of anomalies appear, and it is extremely dif-
ficult to obtain consistency. Software designers underestimate the
amount of variation there is in the orthographic system, the per-
vasive nature of language change, and the influence context has in
deciding whether an orthographic feature is obligatory or optional.
For example, there are contexts where the ignoring of an apostro-
phe in a search is inconsequential (e.g. inStPaul’sCathedral,where
the apostrophe is often omitted in general usage anyway), but in
other contexts it can be highly confusing. Proper names can be
disrupted –John O’Reillyis notJohn OreillyorJohnOReilly(a ma-
jor problem for such languages as French and Italian, where forms
such asd’ andl’ are common). Hyphens can be critical unifiers,
as inCD-ROMandX-ray.Similar problems arise when slashes
and dashes are used to separate words or parts of words within an
expression, as in many chemical names. Disallowing the amper-
sand makes it hard to find such firms asAT&TorP&O, whether
solid or spaced; no hits may be returned, or theP...Ostring is
swamped by otherPOhits, where the ampersand has nothing to
do with their identity. When more than one of these conventions
are involved in the same search, the extent to which the search-
engines simplify the true complexity of a language’s orthography
is quickly appreciated. Brookes^31 points out that a string such as


(^31) Brookes (1998).

Free download pdf