MaximumPC 2004 04

APRIL 2004 MAXIMUMPC



Like tiny ants bringing food to their queen, thousands of Googlebots scour every corner of the web so that the search engine can rapidly deliver accurate results to you. What makes Google different from a simple

index of pages? How can Google search billions of web pages and determine accurate results for any query in two-tenths of a second? Much like an anthill, many smaller processes work together to tremendous effect.

Thousands of Googlebots scour the web every day. When they visit a site, they fetch the text from every page and transmit it back to Google.

Like ants, Googlebots scour the web. They copy individual pages to the Repository, where they are indexed for Google searching. Googlebots fetch more than 100 pages a second from the web!

The URL Server uses URLs from the index to make sure the Googlebots visit every site on the web.

The Document Index keeps track of every document in the Google cache. It’s sorted by the Google docID, but also stores the URL and title of each document.

The Lexicon is the full list of all the keywords that Google keeps in its index. The search engine uses the Lexicon to determine whether a web page contains terms relevant to your search.

When you press Search, the search engine creates a list of all the pages that include your keywords. Then Google takes those pages and sorts them according to their PageRank scores. Excerpts from the relevant pages are retrieved from the Barrels, and the results are written, all in two-tenths of a second.

After being indexed, processed pages are stored in Barrels , a large collection area where pages can be quickly accessed by the search server whenever they’re needed.

PageRank is the not-so-secret secret of Google’s success. PageRank analyzes the links between pages and assumes that pages with more links pointing to them have better information.

The Indexer pulls the important keywords out of every page in the Repository. Then it adds a link to that page into its massive keyword index.

Once a page gets indexed, all of its URL links are extracted and stored in the Anchor File. The URL Resolver associates URLs from the anchor file with specific docIDs in the index. The information the URL Resolver generates is used to compute the PageRank of all documents.

The Repository is the first stop for a page once it’s been processed by a Googlebot. Here the page is assigned a docID and stored along with billions of pages that a Google search covers.

PageRank

PageRank

1

2

3

4 5

6

7

GOOGLE BELOW THE SURFACE

Speed, Precision, PageRank



MaximumPC 2004 04

Get our desktop app

Company

Features

Documentation

Resources