Hacking Gmail

186 Part III — Conquering Gmail

Introducing Basic Scraping

Every page on the web can be scraped— it can be downloaded by a script and have its content mined and used as the input for a program. The complexity of this task is dependent on the way the page itself is coded: One of the key reasons why XHTML is so encouraged is that to be correct, XHTML also has to be well- formed XML. Well-formed XML can be processed with a whole raft of useful tools that make the job a simple one. Badly formed markup, like that of Gmail, is different. This “tag soup” requires a more complicated processing model. There are a few, but you’re going to use the method produced by the Perl module HTML::TokeParser — Token Parsing.

HTML::TokeParser

Imagine the web page is a stream of tags. With HTML::TokeParser, you leap from tag to tag, first to last, until you reach the one you want, whereupon you can grab the content and move on. Because you start at the top of the page, and spec- ify exactly how many times you jump, and to which tags, an HTML::TokeParser script can look a little complicated, but in reality it’s pretty easy to follow. You can find the HTML::TokeParser module at http://search.cpan.org/~gaas/ HTML-Parser-3.45/lib/HTML/TokeParser.pm.

If you flip to Appendix A, Listing A-4 shows the HTML code of the Gmail Inbox you want to walk through.

As you can see from the listing, the page is made up of lots of tables. The first displays the yellow banner advertising the JavaScript-enhanced version. The second holds the search section. The third holds the left-hand menu, the fourth the labels, and so on, and so on. It is only until you get to the table that starts with the following code that you get to the Inbox itself:

<table width=100% cellpadding=2 cellspacing=0 border=0 bgcolor=#e8eef7 class=th>

But looking at this section of the code brings you hope and joy. Listing 13-2 shows the code that displays the first and last messages in the Inbox shown in Figure 13-1.

Hacking Gmail

Introducing Basic Scraping

HTML::TokeParser

Get our desktop app

Company

Features

Documentation

Resources