AJAX - The Complete Reference

506 Part III: Advanced Topics

and then inspect the query string:

http://www.google.com/search?hl=en&q=Screen+Scraping&btnG=Google+Search

It is clear from this that we change the query easily enough to the more technically appropriate term “Web Scraping,” like so:

http://www.google.com/search?hl=en&q=Web+Scraping&btnG=Search

Since that is all we need to do to alter a search, it would seem we could automate the trigger of a Google search quite easily. For example, in PHP we might simply do:

$query = "screen+scraping"; // change to whatever $url = "http://www.google.com/search?hl=en&q=$query&btnG=Google+Search";

$result = file_get_contents($url);

Now in $result we are going to get a whole mess of HTML, like so:

<html><head><meta http-equiv=content-type content="text/html; charset=UTF-8"> <title>Screen Scraping - Google Search</title><style>div,td,.n a,.n a: visited{color:#000}.ts ... snip ...

<div class=g><link rel="prefetch" href="http://en.wikipedia.org/ wiki/Screen_scraping"><h2 class=r><a href="http://en.wikipedia.org/wiki/ Screen_scraping" class=l onmousedown="return clk(0,'','','res','1','')"><b> Screen scraping</b> - Wikipedia, the free encyclopedia</a></h2><table bor- der=0 cellpadding=0 cellspacing=0><tr><td class="j"><font size=-1><b>Screen scraping</b> is a technique in which a computer program extracts data from the display output of another program. The program doing the <b>scraping </b> is <b>...</b><br><span class=a>en.wikipedia.org/wiki/<b>Screen</b>_ <b>scraping</b> - 34k - </span><nobr>

...snip...

We could try to write some regular expressions or something else to rip out the pieces we are interested in, or we might rely on the DOM and various XML capabilities available. Most server-side environments afford us better than brute force methods, so we instead load the URL and build a DOM tree.

$dom = new domdocument; /* fetch and parse the result */ $url = 'http://www.google.com/search?hl=en&q=screen+scraping&btnG=Google+Search'; @$dom->loadHTMLFile($url);

Then we take the DOM tree and run an Xpath query on the results to rip out what we are interested in, in this case some links. After having inspected the result page, it appears that the good organic results have a class of “l” (at least at this point in time), so we pull out only those nodes from the result.

/* use xpath to slice out some tags */ $xpath = new domxpath($dom); $nodes = $xpath->query('//a[@class="l"]');

AJAX - The Complete Reference

Get our desktop app

Company

Features

Documentation

Resources