P1: IML
Wisman WL040/Bidgoli-Vol III-Ch-59 August 14, 2003 18:3 Char Count= 0
HOWSEARCHWORKS—VIEWSFROM THESEARCHENGINE 731Automated Search Engines
Automated Web search engines have two main tasks; one of indexing the Web
information, the second of answering search queries from the index. First, an indexing
program visits a website much as you would with a browser, normally starting at the
default homepage, visiting connected pages and indexing the site information (see Figure 6).Figure 6: How the HTML of Figure 7 would appear in a browser.Which ranking approach produces the best result
depends upon the user’s search needs. Comprehensive
search is the natural outcome of search based on word
match ranking alone but yields no organization of the re-
sults. Existence and exploratory search can benefit from
the reference-based ranking methods of popularity and
importance. Popularity ranking anticipates that the infor-
mation that many others reference represents common
knowledge of a subject. Importance attempts to refine
popularity ranking by organizing references into support-
ing groups. Grouping together documents that have com-
mon references will generally provide more homogenous
results and is best suited for exploratory search where the
subject is recognized.What Search Engines Search
Web pages can be richer sources of search information
than traditional documents such as books and journals be-
cause of the natural connections formed to related pages
and the characteristics of the hypertext markup language
(HTML) used for writing Web pages. Web search engines
seek to improve upon traditional retrieval systems by ex-
tracting added information from the title, description,
and keyword HTML tags and by analyzing the connect-
ing links to and from a page.
Recognizing the parts of a Web page that attract the
attention of indexing spiders is critical to Web site design-
ers attempting to raise the visibility of the Web site. Ide-
ally, a Web site designer could give instructions to visiting
spiders on precisely how best to index the page to produce
high quality search results. Unfortunately, self-promoting
Web sites generally have a history of hijacking spider in-
dexing rules for their own benefit. In response to blatant
self-promotion, few spiders observe a strict protocol as to
which page to index or which parts of the page are con-
sidered important. However, most spiders do observe the
following common guidelines (Sonnenreich & Macinta,
1998):Content
The result of search is the page content that the searcher
sees and reads. The readable text, as displayed by a
browser in Figure 6, provides the bulk of the words
indexed by the spider. As noted, stop words are worth-
less in distinguishing one page from another and are ig-
nored. Less common words increase the page rank but
are valuable only if a searcher uses that word in a query.
Using many different words in a page improves search
breadth but the words must be obvious for a searcher to
use in a query. Including important keywords in the title,
increasing the frequency of a keyword the text, and plac-
ing keywords near the beginning of the page content can
improve page rank on most search engines. Be aware thatrepeating a keyword multiple times in the title may gain
a higher ranking but many search engines ban blatantly
bogus attempts at manipulation and may reject the page
or site entirely. The challenge to the page writer is to find
the right keywords rare enough to stand out, descriptive
of the content subject and that are familiar to the searcher.
Bear in mind that most indexing spiders only examine the
first few hundred words of content so it is important to
provide descriptive keywords early in the content text.Tags
HTML tags are not generally visible to the reader but do
contain information important to the spider. Along with
content keywords, spiders also extract the page location
and may examine HTML tags when indexing a page. The
Web site designer can influence the page rank and provide
more descriptive results to the searcher through the tags.
Figure 7 gives the source for a HTML page to illustrate
the page content and use of the following tags.Keyword:The HTML keyword meta tag contains human-
defined keywords to augment the automated indexing
of the page content. One use of the tag is to provide al-
ternative words or phases for those in the content, for
example, using “PDA” in the content and “personal dig-
ital assistant” as keywords. Unfortunately, promoters
have so often abused the keyword tag that Web search
engines generally ignore it. When search is limited to
a trustworthy site, such as a university Web site, key-
words can be valuable to the designer and searcher.
Description:The description meta tag provides a short
content summary for display when the search engine
retrieves the page. Figure 8 illustrates how a search
engine would display the description tag with other
page information.
Title:Indexing the title tag independently allows explicit
searches on the title; the search engine can also display
the title as part of the page information, as in Figure 8.
As previously mentioned, keywords placed in the title
can also improve page rank.
Heading:The large print of headings catches the atten-
tion of the reader and also is important to an indexing
spider. The influence of headings on rank generally fol-
lows the scale of the heading number, so that weight
of the words of a level 1 heading is greater than the
weight of the words of a level 2 heading.
Links:The spider follows link connections to other doc-
uments through the attribute and hypertext reference
tag; for example, “<a href = ‘Figure6.html’>” directs
the spider to follow the link to the index page “Figure6.
html.” The popularity and importance methods would
generally rank a page with many links from other pages