P1: JDW
Research WL040/Bidgolio-Vol I WL040-Sample.cls June 19, 2003 17:13 Char Count= 0
INTERNETCOMMUNICATION 205this purpose, academic and trade journals, newsletters,
and various online publications dedicated to this technol-
ogy are invaluable.Document Code
Advances in HTML code, such as extensible markup
language (XML), enhance machine-readable structure
within the document itself, thus augmenting search preci-
sion. XML, an extension of SGML, is a structural ordering
system which defines a range of discreet document sec-
tions. XML enables generic SGML to be served, received,
and processed on the Web similar to HTML, thus mak-
ing available a number of discrete document sections or
fields that are not currently available in standard HTML.
The potential growth of XML holds great promise for re-
searchers dissatisfied with the lack of precision in the
structure of standard HTML documents.The Invisible Web
The Invisible Web refers to information on the Internet
that has value but is not typically indexed by search en-
gines. There are two general categories of material that go
undetected.
The first category consists of information not detected
by the various crawlers and spiders search engines use
to index Internet content. This lack of detection is occa-
sionally due to Web pages that possess no external hyper-
links. Pages of this type are considered disconnected. If a
crawler cannot get into the content, it cannot index it, and
since most crawlers rely on hyperlinks as their conduit, if
there are no links in, crawlers cannot find them.
Site depth also accounts for invisibility. Some search
engines do not index entire documents, but specify a cer-
tain percentage or number of pages they do index. Others
only index pages to a certain level, which means that in-
formation several levels below the home page may not be
indexed. Information on site coverage is characteristically
found in the “About” information on a search engine site,
and it is well worth investigating when evaluating search
engines for research potential.
Further information can remain hidden from search
engines because it contains content that cannot be inter-
preted by a crawler. Among the thousands of file types
on the Web, there is only a handful that crawlers capably
identify. Some crawlers are more adept than others, and
a few search engines allow the user to specify file type,
retrieving file types with extensions such as doc, pdf, ppt,
gif, jpg, mpg, aud, and others.
Other obstructions to crawler technology are online
forms. Forms act like gates that need to be entered be-
fore content becomes available. Crawler technology is co-
evolving and will soon be capable of passing through at
least some forms.
Dynamically generated pages are yet another impedi-
ment for crawlers, since these are pages created by script-
ing language in response to a database query, form, or
other user interaction. These vary dependent on the user
request, and it is impossible to index them all. The best a
crawler can do is index the interface gate, which may be
adequate if enough terminology is provided by the page
author.There are also many specialized databases of jour-
nal articles, reports, newsletters, corporate and organiza-
tional information, and so forth, whose content is hidden
from crawlers. To search these databases effectively, one
needs to identify the database and search it independently.
An Internet search engine will not return Medline or ERIC
results. These remain invisible to crawlers.
The other primary category of invisible information is
hidden information. Hidden information is not detected
by crawlers because the author or webmaster desires it so.
Webmasters use different methods of blocking access to
sites (Sherman & Price, 2001). Blocking protocols, such as
the robots exclusion protocol (which creates a list on the
server of files not to be crawled), or “noindex” metatags
will keep many crawlers out. Passwords are a more effec-
tive method of thwarting crawlers and keeping informa-
tion private.INTERNET COMMUNICATION
Electronic Mail
Electronic mail was originally created as a convenience
or entertainment but quickly became a vital component
of the research process, giving researcher’s global ac-
cess to people and information instantly. A major break-
through occurred when it was discovered that the trans-
mission of documents, graphics, presentations, and other
files was possible by attaching them to e-mail messages.
This quickly became an invaluable tool for researchers
who were jointly writing and editing. In addition, e-mail
grew in popularity as an alert service for articles, publica-
tions, products, conferences, and other events of interest
to scholars and researchers.Mailing Lists and Newsgroups
An evolution of e-mail that has become de rigueur for
many researchers is that of mailing lists and newsgroups.
Mailing lists employ software that disseminates or makes
available information, commentaries, and questions si-
multaneously to all subscribers of the list. The first mail
list designed specifically for research, THEORYNET, orig-
inated at the University of Wisconsin in 1977. In these,
and all other mailing lists on ARPAnet and the early
BITNET, human intervention was required to add sub-
scribers to the list and distribute the e-mail. By 1985,
BITNET had replaced ARPAnet as the academic and re-
search network. Their mail list, called LISTSERV, was also
person-moderated, and experienced enormous delays in
both subscriptions and mail delivery.
LISTSERV software, invented in 1986, automated this
process. Through automation mailing lists have flour-
ished. Catalist, the primary directory of LISTSERVs,
listed 210,949 at the time of this writing. While the term
LISTSERV has become genericized to represent any sim-
ilar software or group, there are numerous other push
mailing list technologies, such as (Majordomo or Gnu) in
existence, and other directories that index them.
LISTSERVs can be moderated or unmoderated. Mod-
erated lists are mailing lists that employ a human editor
to filter submitted messages before they are distributed.
Unmoderated lists disseminate to all members without