The Internet Encyclopedia (Volume 3)

(coco) #1

P1: IML


Wisman WL040/Bidgoli-Vol III-Ch-59 August 14, 2003 18:3 Char Count= 0


734 WEBSEARCHFUNDAMENTALS

Table 2Server Access Log Fields in Common Logfile Format

Access Log Field Example
Client IP address 24.10.2.3
Client identity-unreliable −
Authenticated client userid −
Time request completed [01/May/2002:17:57:03 -0400]
Client request line “GET/mainpage.htm HTTP/1.1”
Server status code 404
Referring site “http://www.food.com/search.html”
Client browser “Mozilla/4.08 [en] (WinNT; I;Nav)”

part of the Web and search that is restricted to a single
Web site.

Web Site Discovery
How is a Web site discovered by a search engine? Given
that any one search engine indexes only a small fraction
of the Web (Lawrence & Giles, 1999), the answer is of crit-
ical importance to the Web site designer hoping to attract
visitors. Most search engines accept free submissions for
indexing all or part of the Web site and paid submission to
multiple search engines is available through service com-
panies. Links from other sites will also widen visibility
and speed the discovery of a Web site; sites with few links
have a lower probability of being indexed. The most cer-
tain and direct approach is to purchase keywords on a
search engine; a query with a site’s keyword is guaran-
teed to return the site, normally before those listed by the
merit of rank. Once a Web site is discovered by search
engines, the methods examined earlier to influence auto-
mated search become important, though often the best
strategy is to develop and maintain high-quality content
to attract and cultivate loyal visitors. Where content qual-
ity or time is in shorter supply than money, the paid listing
will guarantee that a site is highly ranked by at least one
search engine.

Measuring Success
How can a site’s owner determine if efforts to attract
search engine attention have been a success? Search en-
gines represent the most obvious and direct means to
check if and to what extent a specific search engine has in-
dexed a site. Table 1 contains many of the controls needed
to limit search to a specific Web site. These same controls
used by searchers can also provide feedback to point out
search problems with the Web site. Although tests with in-
dividual search engines will determine if and how a Web
site has been indexed, it will not tell if, why, or how any-
one visits. The site server holds the primary information
on Web site success in the server access log file. The log
holds details about every attempted or successful visit;
Table 2 lists the information retained in the Web server ac-
cess log following the Common Logfile Format. Free and
commercial analysis software can produce detailed sum-
maries and graphs of the log; however, the most telling
information about search success is contained in the fol-
lowing three fields:

Client Request Line:Contains the page on the server the
visitor requested. For visitors arriving from a search
engine, this contains the link to the page indexed by
the spider.
Server Status Code:Status codes starting witha2in-
dicate success; those starting with 4 indicate that the
visitor probably encountered a mistake. In Table 2, the
“mainpage.htm” page does not exist, earning the visitor
a “404 Not Found” response from the Web site.
Referring Site:The visitor reports the site that referred
them to this site. In Table 2, the visitor arrived via
a reference to “mainpage.htm” made by the “www.
food.com” search engine.

Of what value is the access log? Examining the log en-
tries points out errors and successes. Counting the num-
ber of visitor requests (i.e., client request line) to each
page immediately grades pages on success in attracting
visitors and, by their absence, identifies those pages that
failed. Investigating the referrer list will show how visi-
tors arrive at a Web site; search engines missing from the
list have not indexed the Web site or rank its pages below
others. A table of visitor page requests with the referring
site will clearly show which search engines successfully
found specific pages and can flag pages that create index-
ing problems for particular search engines. As discussed
earlier, some spiders are stopped by frames, only index the
first few content lines, or crawl a limited number of links
on each site. Pages that are never accessed can indicate in-
dexing problems for the spider or navigation problems for
the visitor. Examining the access log file is a good starting
point for finding these and other potential search engine
and link problems.
Can the log tell us when a complete site, or at least
part of a site, is broken? Interpreting the three fields for
client request line, server status code, and referring site
from the Table 2 example tell us that the “food.com”
search engine referred a visitor via an indexed page link
to “mainpage.htm” and that link is now broken, as re-
ported by a “404 Not Found” message. A likely reason the
link broke is that the Web site was reorganized since the
last visit by the “food.com” spider or the page location on
the site otherwise changed. Should we inform “food.com”
they now have a problem and need to fix their link to
point to the new page location? Unreorganizing an active
Web site is no solution because other search engines may
have already indexed the new site organization. A limited
Free download pdf