The Internet Encyclopedia (Volume 3)

P1: IML

Wisman WL040/Bidgoli-Vol III-Ch-59 August 14, 2003 18:3 Char Count= 0

732 WEBSEARCHFUNDAMENTALS

<html> <head> <meta name="description" content="Human lists and automated search engines."> <meta name="keywords" content="search engine, indexing"> <title>How Search Engines Work</title> </head> <body>

<h1>Automated Search Engines</h1>

Automated Web search engines have two main tasks; one of indexing the Web information, the second of answering search queries from the index. First, an indexing program visits a website much as you would with a browser, normally starting at the default homepage, visiting connected pages and indexing the site information (<a href="Figure6.html">see Figure 6</a>). </body> </html>

Figure 7: An HTML page contains visible parts displayed by the browser and hidden parts that can help spiders index the page more accurately, provide descriptive information to the searcher, and link to other HTML pages.

relatively high. Influencing other sites to link to a site’s pages is not easy, depending upon good content to at- tract recognition. Cross-listing agreements with other sites to link to a site is one technique that can provide a quicker start to that recognition.

What Search Engines Ignore Can designers make their Web sites invisible? Are all spiders the same? Not every word on a Web page will lead the searcher to the page. Most spiders purposely ignore or cannot see large parts of a page that human readers might see and use; carelessness in page design can force both spiders and human searchers away. The most com- mon spider problems and solutions are the following:

Frames:The purpose of frames is to visibly divide a browser screen into several parts, but unfortunately frames can stop an indexing spider and create confu- sion for visitors arriving from a search engine. At least three separate pages are needed for frames: a hidden frameset page that defines the frames and links visible pages into the frames, a second page for visible content, and a third that is often for navigation. A spider normally arrives at the hidden frameset page but must understand how to handle frames in order to follow links to the other, visible pages. Spiders that do not understand frames simply stop and never index further. For those spiders and browsers that do not understand

How Search Engines Work Human lists and automated search engines. http://www.insearchof.org/how.htm

Figure 8: An example of how a search engine might respond to a query. The word “indexing” is part of the keyword meta tag and embedded in the text content. The title is “How Search Engines Work” and the description meta tag is “Human lists and automated search engines.” The document URL “www.insearchof.org/how.htm” and the title provide links to the complete document.

frames, the remaining site pages may be unreachable. Because frames cause some spiders problems, the ob- vious solution is to avoid the use of frames entirely. However, when the Web site designer is forced to use frames, including the “noframes” tag exposes alterna- tive text for navigation and content to frame-ignorant spiders and browsers. The “noframes” tag designates text to be displayed in place of the framed pages, effec- tively duplicating the page content and design effort. Spiders that understand frames create a different prob- lem. Visitors can now arrive at a content page directly from the search engine rather than through the main frameset page as the Web site designer intended. With- out the main frameset page there is no navigation page either; visitors can become wedged on a dead-end page and, without any navigation, forced to leave the site. One solution is to place a link to the main frameset or site home page in every navigation and content page to help keep the visitor on the site. Of course, another solution is to avoid frames altogether. Scripts:Most spiders ignore script programs written in JavaScript or other scripting languages; others simply index the script program text. Spiders that index the script may also index only the first few hundred words of the document and possibly never reach the content. Place important content and keywords before scripts and, for pages that are mostly scripts, include title, keyword, and description tags. Java Applets and Plug-Ins:To a spider, a Java applet, plug-in, or other browser-executed program is invisible. For indexing purposes, include descriptive content and tags within the page that contains the program. Server-Generated Pages:Spiders may ignore any un- usual link references, such as ones that do not end in “HTM” or “HTML.” For example, a spider will follow the connecting link “<a href = ‘mainpage.HTML’>” but may not follow the link to the Web server program of “<a href = ‘mainpage.ASP’>” due to the end- ing “ASP.” Generating the Web site main page with a server program could mean that some spiders ignore the complete site. One solution is to provide some

The Internet Encyclopedia (Volume 3)

Get our desktop app

Company

Features

Documentation

Resources