186 Part III — Conquering Gmail
Introducing Basic Scraping
Every page on the web can be scraped— it can be downloaded by a script and have
its content mined and used as the input for a program. The complexity of this task
is dependent on the way the page itself is coded: One of the key reasons why
XHTML is so encouraged is that to be correct, XHTML also has to be well-
formed XML. Well-formed XML can be processed with a whole raft of useful
tools that make the job a simple one. Badly formed markup, like that of Gmail, is
different. This “tag soup” requires a more complicated processing model. There are
a few, but you’re going to use the method produced by the Perl module
HTML::TokeParser — Token Parsing.
HTML::TokeParser
Imagine the web page is a stream of tags. With HTML::TokeParser, you leap
from tag to tag, first to last, until you reach the one you want, whereupon you can
grab the content and move on. Because you start at the top of the page, and spec-
ify exactly how many times you jump, and to which tags, an HTML::TokeParser
script can look a little complicated, but in reality it’s pretty easy to follow. You can
find the HTML::TokeParser module at http://search.cpan.org/~gaas/
HTML-Parser-3.45/lib/HTML/TokeParser.pm.
If you flip to Appendix A, Listing A-4 shows the HTML code of the Gmail Inbox
you want to walk through.
As you can see from the listing, the page is made up of lots of tables. The first dis-
plays the yellow banner advertising the JavaScript-enhanced version. The second
holds the search section. The third holds the left-hand menu, the fourth the
labels, and so on, and so on. It is only until you get to the table that starts with the
following code that you get to the Inbox itself:
<table width=100% cellpadding=2 cellspacing=0 border=0 bgcolor=#e8eef7
class=th>
But looking at this section of the code brings you hope and joy. Listing 13-2
shows the code that displays the first and last messages in the Inbox shown in
Figure 13-1.