MaximumPC 2004 04

(Dariusz) #1

 MAXIMUMPC APRIL 2004


like news sites and blogs get crawled
more often. This essentially means that
the more frequently a page is updated,
the more frequently it gets crawled by
Googlebots, and the more likely it is to
appear higher up in the search results.
Before any page can become a search
result, however, Google must index all
the information the page contains. If
a page contains the word “CPU”, then
Google will add a link to this page in
its comprehensive index of all sites that
mention “CPU”. As you might imagine,
indexing is extremely processor-intensive
because every word that appears on your
web page gets added to a gargantuan
petabyte-sized Google index. (One
petabyte is one million gigabytes.) Hence
the giganti-normous server configuration.
Interestingly, while the indexer
performs this process, it not only
converts each word on a page into a
list of “hits,” it also takes note of font
size and the use of bold and underline
effects. Larger text and underlined fonts
are deemed more important. A copy
of the actual page is then saved in a
doc server repository, and the page is
given a unique DocID that helps the
index quickly retrieve this web page and
generate the snippet description that is
your search result (See diagram on the
opposite page.)
Finally, a URL Resolver scans each
page and takes special note of each
link. Google stores the link, the URL
the link points to, and the page the link
was on. Then it generates a database of
links that’s used by Google’s remarkably
democratic PageRank technology to
determine relative importance. From
there the data gets dumped into a large
collection area known as “barrels”
that are accessed by the actual search

server. To ensure that the storing and
retrieving of these bits of information is
lightning fast, data is compressed and
decompressed as it moves along.
When you consider that the average
web page has more than one thousand
words, building an up-to-date index
with both the terms and links to the
relevant pages on the web is easily the
most storage- and processor-intensive
activity a search engine must do. This
is why the Google architecture was
designed to scale—instead of a single
indexing machine churning through
an ever expanding index queue, many
slower indexing machines run in parallel
to churn through the queue faster than
any single machine could. As the number
of web pages grows, Google simply
throws more cheap PCs into the mix; the
number of bots and indexers grows right
along with the Internet.

The Power of PageRank
The secret to delivering incredible search
results lies in a proprietary technology
Google calls PageRank. PageRank is a score
composed of several different desirability
measurements. One such measurement is
based on the concept of a random surfer
who follows trails of links from page
to page. An algorithm devised by Brin
and Page calculates the probability that
“random visitor X” will be enticed into
clicking certain links, thereby raising the
PageRank of a page likely to be clicked,
and lowering the score of a page that’s
statistically unlikely to be clicked.
Another measurement takes into
account the number of sites pointing to a
specific URL. This means that a page linked
to by multiple sites is given a higher score
than pages with just a few links.

Yet another PageRank measurement
considers the caliber of sites that point
to a URL. For example, a single link from
Yahoo News to your site is worth more
than several links from smaller sites.
Either way, PageRank will increase your
site’s overall score, but because Yahoo
News is a big site that is itself linked to
by lots of other sites, PageRank gives a
higher value-ranking to a link from it.
All these factors get rolled into the
PageRank system and then passed
through one more anchor filter that
examines font size and link proximity.
The result is a numerical ranking that’s
amazingly accurate. Also, because Google
combs through and catalogues every link
on a page, it can capture information
in non-HTML formats, such as links to
PowerPoint presentations or PDFs. This is
why Google results are often more than
simple HTML web pages. It’s all in the
name of giving users access to the very
best information regardless of the format.

Deconstructing Search
So how does all this translate into useful
results every time you enter a search
query into the Google home page?
While Google may seem like an ATM
that dispenses answers only after you type
in a request, the search engine is actually
working behind the scenes long before and
even after you type in your query. URLs
are continuously located and added to the
master list that Googlebots dutifully crawl
each day. Words are continually located
and added to the massive index. Web
pages, PDFs, and images are repeatedly
cached and catalogued in the doc server.
Day after day, as the web continues its
exponential growth, Google maintains its
Internet vigilance.
When you fire off a search, it’s the
raw power of PageRank that allows the
server to quickly comb the index and
concoct a custom list of answers that best
fit your query. These answers are then
quickly paired with summary snippets
clipped from the cached document
copy, which explains why a summary
occasionally references a lead item that’s
a few days older than the live page itself.
If you update your site minutes after the
Googlebot has crawled by, a snippet of
the last known version is what shows up
in the summary, even though the actual
rank of your site in the results is based on
its PageRank score. Remember, PageRank
is primarily based on the number and
quality of sites linking to your site.
Here’s a real-world example: If 1,000

     Inside Out    


If Google the search engine was built
around the concept of constant scal-
ability then Google the company is
built around the concept of constant
innovation. Since day one, the com-
pany has been obsessed with drum-
ming up new ways to make finding
information online easier and easier.
The best place to keep tabs on
the latest developments is Google
Labs. Bookmark this URL and you
won’t go wrong: http://labs.google.com.
Google is serious about listening to
the feedback of the technical com-

munity that samples the new ideas
presented there. As the site puts it:
“Google engineers and researchers
kept looking for a way to show off
their pet projects. This seemed like a
great way for them to get feedback
without forcing every new feature on
all of Google’s users. So, please, send
them a note and let them know if the
technology is useful or not. And be
frank. It doesn’t help anyone if a bad
idea is encouraged to spread like a
noxious weed.”
Amen!

Google Labs’ Search for the Next Big Thing


        


continued on page 48
Free download pdf