AJAX - The Complete Reference

(avery) #1

506 Part III: Advanced Topics


and then inspect the query string:

http://www.google.com/search?hl=en&q=Screen+Scraping&btnG=Google+Search

It is clear from this that we change the query easily enough to the more technically
appropriate term “Web Scraping,” like so:

http://www.google.com/search?hl=en&q=Web+Scraping&btnG=Search

Since that is all we need to do to alter a search, it would seem we could automate the
trigger of a Google search quite easily. For example, in PHP we might simply do:

$query = "screen+scraping"; // change to whatever
$url = "http://www.google.com/search?hl=en&q=$query&btnG=Google+Search";

$result = file_get_contents($url);

Now in $result we are going to get a whole mess of HTML, like so:

<html><head><meta http-equiv=content-type content="text/html; charset=UTF-8">
<title>Screen Scraping - Google Search</title><style>div,td,.n a,.n a:
visited{color:#000}.ts
... snip ...

<div class=g><!--m--><link rel="prefetch" href="http://en.wikipedia.org/
wiki/Screen_scraping"><h2 class=r><a href="http://en.wikipedia.org/wiki/
Screen_scraping" class=l onmousedown="return clk(0,'','','res','1','')"><b>
Screen scraping</b> - Wikipedia, the free encyclopedia</a></h2><table bor-
der=0 cellpadding=0 cellspacing=0><tr><td class="j"><font size=-1><b>Screen
scraping</b> is a technique in which a computer program extracts data from
the display output of another program. The program doing the <b>scraping
</b> is <b>...</b><br><span class=a>en.wikipedia.org/wiki/<b>Screen</b>_
<b>scraping</b> - 34k - </span><nobr>

...snip...

We could try to write some regular expressions or something else to rip out the pieces
we are interested in, or we might rely on the DOM and various XML capabilities available.
Most server-side environments afford us better than brute force methods, so we instead
load the URL and build a DOM tree.

$dom = new domdocument;
/* fetch and parse the result */
$url = 'http://www.google.com/search?hl=en&q=screen+scraping&btnG=Google+Search';
@$dom->loadHTMLFile($url);

Then we take the DOM tree and run an Xpath query on the results to rip out what we
are interested in, in this case some links. After having inspected the result page, it appears
that the good organic results have a class of “l” (at least at this point in time), so we pull out
only those nodes from the result.

/* use xpath to slice out some tags */
$xpath = new domxpath($dom);
$nodes = $xpath->query('//a[@class="l"]');
Free download pdf