MaximumPC 2004 04

MAXIMUMPC APRIL 2004

like news sites and blogs get crawled more often. This essentially means that the more frequently a page is updated, the more frequently it gets crawled by Googlebots, and the more likely it is to appear higher up in the search results. Before any page can become a search result, however, Google must index all the information the page contains. If a page contains the word “CPU”, then Google will add a link to this page in its comprehensive index of all sites that mention “CPU”. As you might imagine, indexing is extremely processor-intensive because every word that appears on your web page gets added to a gargantuan petabyte-sized Google index. (One petabyte is one million gigabytes.) Hence the giganti-normous server configuration. Interestingly, while the indexer performs this process, it not only converts each word on a page into a list of “hits,” it also takes note of font size and the use of bold and underline effects. Larger text and underlined fonts are deemed more important. A copy of the actual page is then saved in a doc server repository, and the page is given a unique DocID that helps the index quickly retrieve this web page and generate the snippet description that is your search result (See diagram on the opposite page.) Finally, a URL Resolver scans each page and takes special note of each link. Google stores the link, the URL the link points to, and the page the link was on. Then it generates a database of links that’s used by Google’s remarkably democratic PageRank technology to determine relative importance. From there the data gets dumped into a large collection area known as “barrels” that are accessed by the actual search

server. To ensure that the storing and retrieving of these bits of information is lightning fast, data is compressed and decompressed as it moves along. When you consider that the average web page has more than one thousand words, building an up-to-date index with both the terms and links to the relevant pages on the web is easily the most storage- and processor-intensive activity a search engine must do. This is why the Google architecture was designed to scale—instead of a single indexing machine churning through an ever expanding index queue, many slower indexing machines run in parallel to churn through the queue faster than any single machine could. As the number of web pages grows, Google simply throws more cheap PCs into the mix; the number of bots and indexers grows right along with the Internet.

The Power of PageRank The secret to delivering incredible search results lies in a proprietary technology Google calls PageRank. PageRank is a score composed of several different desirability measurements. One such measurement is based on the concept of a random surfer who follows trails of links from page to page. An algorithm devised by Brin and Page calculates the probability that “random visitor X” will be enticed into clicking certain links, thereby raising the PageRank of a page likely to be clicked, and lowering the score of a page that’s statistically unlikely to be clicked. Another measurement takes into account the number of sites pointing to a specific URL. This means that a page linked to by multiple sites is given a higher score than pages with just a few links.

Yet another PageRank measurement considers the caliber of sites that point to a URL. For example, a single link from Yahoo News to your site is worth more than several links from smaller sites. Either way, PageRank will increase your site’s overall score, but because Yahoo News is a big site that is itself linked to by lots of other sites, PageRank gives a higher value-ranking to a link from it. All these factors get rolled into the PageRank system and then passed through one more anchor filter that examines font size and link proximity. The result is a numerical ranking that’s amazingly accurate. Also, because Google combs through and catalogues every link on a page, it can capture information in non-HTML formats, such as links to PowerPoint presentations or PDFs. This is why Google results are often more than simple HTML web pages. It’s all in the name of giving users access to the very best information regardless of the format.

Deconstructing Search So how does all this translate into useful results every time you enter a search query into the Google home page? While Google may seem like an ATM that dispenses answers only after you type in a request, the search engine is actually working behind the scenes long before and even after you type in your query. URLs are continuously located and added to the master list that Googlebots dutifully crawl each day. Words are continually located and added to the massive index. Web pages, PDFs, and images are repeatedly cached and catalogued in the doc server. Day after day, as the web continues its exponential growth, Google maintains its Internet vigilance. When you fire off a search, it’s the raw power of PageRank that allows the server to quickly comb the index and concoct a custom list of answers that best fit your query. These answers are then quickly paired with summary snippets clipped from the cached document copy, which explains why a summary occasionally references a lead item that’s a few days older than the live page itself. If you update your site minutes after the Googlebot has crawled by, a snippet of the last known version is what shows up in the summary, even though the actual rank of your site in the results is based on its PageRank score. Remember, PageRank is primarily based on the number and quality of sites linking to your site. Here’s a real-world example: If 1,000

     Inside Out    

If Google the search engine was built around the concept of constant scal- ability then Google the company is built around the concept of constant innovation. Since day one, the company has been obsessed with drum- ming up new ways to make finding information online easier and easier. The best place to keep tabs on the latest developments is Google Labs. Bookmark this URL and you won’t go wrong: http://labs.google.com. Google is serious about listening to the feedback of the technical com-

munity that samples the new ideas presented there. As the site puts it: “Google engineers and researchers kept looking for a way to show off their pet projects. This seemed like a great way for them to get feedback without forcing every new feature on all of Google’s users. So, please, send them a note and let them know if the technology is useful or not. And be frank. It doesn’t help anyone if a bad idea is encouraged to spread like a noxious weed.” Amen!

Google Labs’ Search for the Next Big Thing

        

continued on page 48

MaximumPC 2004 04

Get our desktop app

Company

Features

Documentation

Resources