The Internet Encyclopedia (Volume 3)

P1: IML

Wisman WL040/Bidgoli-Vol III-Ch-59 August 14, 2003 18:3 Char Count= 0

HOW TOBESEARCHED—VIEWSFROM THEWEBSITE 733

HTML pages for the spider to index and eventually guide the searcher to program-generated pages. Forms:Collecting visitor information is one of the most important functions of many Web sites, but spiders do not know how to fill out forms; a site login form automatically stops a spider. Spiders that do index the content and links of the page containing the form create potential problems by leading visitors directly to the form page from a search engine rather than through pages intended to precede the form. An exam- ple would be an airline reservation system with forms for itinerary and payment. Visitors arriving directly at the payment form would obviously have problems; consider adding links back to a starting page. Robot Exclusion:Forms represent one good reason to exclude spiders from indexing certain pages. Two stan- dards exist that instruct well-behaved spiders on ex- cluding specified pages. The recognized standard is the “robots.txt” file that lists acceptable and unacceptable directories, pages, and robots (i.e., spiders). A single robot.txt file exists for the entire Web site, which only the site administrator can access, creating a mainte- nance bottleneck as multiple designers make changes to the site. A better but less accepted solution defines a special “robots” meta tag to specify how to index each individual page. Options are that every spider or no spider should index the page, the page should be indexed or not, and the page links should be followed or not. Images:Spiders may index the image location, image title, and alternate text but that is probably all. For searchers to find the image, include some additional information about the image in the page content and the HTML tags. Deeply Linked Pages:Most spiders completely index only small sites, generally indexing only limited pages on each site. Spiders limit indexing to several con- necting links deep, ignoring pages linked beyond that depth. As a rule, keep pages important in attracting visitors linked directly from the site home page. Reorganization and Broken Links:Resist the urge to reorganize the site. Until all the spiders come again, the new greeting for visitors arriving from search engines to pages that have been renamed or otherwise permanently hidden may be “404 Not Found.” Adding new pages to the site is fine; just leave existing page locations and names alone.

Metasearch Individual search engines produce results biased in ways that are unpredictable and invisible to a searcher. Send- ing the same query to several search engines will obtain widely different results that clearly illustrate the bias. One study on search engine bias (Mowshowitz & Kawaguchi, 2002) demonstrated that querying nine popular search engines for information on “home refrigerators” pro- duced 14 different brand names in the cumulative top 50 results of each engine. Reporting of the brands was unpredictable and uneven across search engines consulted; several brands were found by only a single search engine and no search engine found a majority of the 14 brands.

And obviously there was no clue as to the brands not found at all. Given that individual search engines index only a small fraction of the Web and the degree of index overlap among a group of search engines may be small, it makes sense to consult multiple search engines. Metasearch engines automate multiple searches by sending the query to several standard search engines and organizing the fusion of the results uniformly. Unfortunately, metasearch engines cannot broadcast a query to all other search engines but attempt to minimize the use of limited resources such as network bandwidth while maximizing information quality. Although increasing the number of information sources will generally improve recall, it is also likely that precision will suffer correspondingly. Balanc- ing these conflicting goals is the key challenge in designing a metasearch engine. The essential architecture of a metasearch engine (Dreilinger & Howe, 1997) consists of a dispatch mechanism to determine which search engines receive a specific query; interface agents to contact and adapt the query and result formats of each search engine; and a display mechanism that creates and displays a uniform ranking of results after removing duplicates. By depending upon the direct results returned from regular search engines, metasearch engines cannot expand or improve upon the information sources. However, to a searcher, metasearch represents an obvious improvement over a single search engine by the simple increase in the number of search engines consulted and the corresponding increase in the fraction of the Web examined.

HOW TO BE SEARCHED—VIEWS FROM THE WEB SITE The purpose of building a Web site is to attract visitors; information is the lure. Visitors most often find a new site via a search engine, so building an easily found and searched Web site is critical. A study of search success (Users Don’t Learn to Search Better, 2001) illustrates the challenges of designing a Web site for search. After watch- ing 30 searchers search different sites for content that was on the sites, the study concluded: “The more times the users searched, the less likely they were to find what they wanted.” Single searches found the content 55% of the time, those searching twice found the content only 38% of the time, and those searching more than twice never found the content. Nearly 23% of the searchers received a “no results” response on their first searches, causing most to give up immediately. For those who con- tinued to search, results only grew worse. Further com- pounding search problems is the prevalence of invalid links to pages that are no longer accessible; one study (Lawrence et al., 2000) gives the percentage of invalid links ranging from 23% of 1999 pages to 53% of 1993 pages. The collective message seems clear: design the site and pages for search and continually test that search works. Designing a Web site for search is the subject of this section; the details of page search were covered in the previous section. This section divides search of a Web site into two main parts: search that includes the Web site as

The Internet Encyclopedia (Volume 3)

Get our desktop app

Company

Features

Documentation

Resources