MaximumPC 2004 04

(Dariusz) #1

APRIL 2004 MAXIMUMPC 



Like tiny ants bringing food to their queen, thousands of
Googlebots scour every corner of the web so that the search
engine can rapidly deliver accurate results to you. What makes
Google different from a simple

index of pages? How can Google search billions of web pages
and determine accurate results for any query in two-tenths of
a second? Much like an anthill, many smaller processes work
together to tremendous effect.

Thousands of Googlebots scour the web every day. When they visit a
site, they fetch the text from every page and transmit it back to Google.

Like ants, Googlebots scour the web.
They copy individual pages to the
Repository, where they are indexed for
Google searching. Googlebots fetch
more than 100 pages a second from
the web!

The URL Server uses URLs from the index to make
sure the Googlebots visit every site on the web.

The Document Index keeps track of
every document in the Google cache.
It’s sorted by the Google docID, but
also stores the URL and title of each
document.

The Lexicon is the full list of all the
keywords that Google keeps in its index.
The search engine uses the Lexicon to
determine whether a web page contains
terms relevant to your search.

When you press Search, the search engine
creates a list of all the pages that include
your keywords. Then Google takes those
pages and sorts them according to their
PageRank scores. Excerpts from the relevant
pages are retrieved from the Barrels, and
the results are written, all in two-tenths of
a second.

After being indexed, processed pages are
stored in Barrels , a large collection area
where pages can be quickly accessed by the
search server whenever they’re needed.

PageRank is the not-so-secret secret of
Google’s success. PageRank analyzes the
links between pages and assumes that
pages with more links pointing to them
have better information.

The Indexer pulls the important
keywords out of every page in
the Repository. Then it adds a
link to that page into its massive
keyword index.

Once a page gets indexed, all of its URL links
are extracted and stored in the Anchor File.
The URL Resolver associates URLs from the
anchor file with specific docIDs in the index. The
information the URL Resolver generates is used to
compute the PageRank of all documents.

The Repository is the first stop for a page once it’s been pro-
cessed by a Googlebot. Here the page is assigned a docID and
stored along with billions of pages that a Google search covers.

PageRank

PageRank


1


2


3


4 5


6


7


GOOGLE BELOW THE SURFACE


Speed, Precision, PageRank


Free download pdf