Extracting plain text from HTML (revisited)
Now that you understand the basic principles of the HTML parser class in Python’s
standard library, the plain text extraction module used by Chapter 14’s PyMailGUI
(Example 14-8) will also probably make significantly more sense (this was an unavoid-
able forward reference which we’re finally able to close).
Rather than repeating its code here, I’ll simply refer you back to that example, as well
as its self-test and test input files, for another example of HTML parsing in Python to
study on your own. It’s essentially a minor elaboration on the examples here, which
detects more types of tags in its parser callback methods.
Because of space concerns, we have to cut short our treatment of HTML parsing here;
as usual, knowing that it exists is enough to get started. For more details on the API,
consult the Python library manual. And for additional HTML support, check the Web
for the 3.X status of third-party HTML parser packages like those mentioned in
Chapter 14.
Advanced Language Tools
If you have a background in parsing theory, you may know that neither regular ex-
pressions nor string splitting is powerful enough to handle more complex language
grammars. Roughly, regular expressions don’t have the stack “memory” required by
true language grammars, and so cannot support arbitrary nesting of language con-
structs—nested if statements in a programming language, for instance. In fact, this is
why the XML and HTML parsers of the prior section are required at all: both are
languages of potentially arbitrary nesting, which are beyond the scope of regular ex-
pressions in general.
From a theoretical perspective, regular expressions are really intended to handle just
the first stage of parsing—separating text into components, otherwise known as lexical
analysis. Though patterns can often be used to extract data from text, true language
parsing requires more. There are a number of ways to fill this gap with Python:
Python as language tool
In most applications, the Python language itself can replace custom languages and
parsers—user-entered code can be passed to Python for evaluation with tools such
as eval and exec. By augmenting the system with custom modules, user code in
this scenario has access to both the full Python language and any application-
specific extensions required. In a sense, such systems embed Python in Python.
Since this is a common Python role, we’ll revisit this approach later in this chapter.
Custom language parsers: manual or toolkit
For some sophisticated language analysis tasks, though, a full-blown parser may
still be required. Such parsers can always be written by hand, but since Python is
built for integrating C tools, we can write integrations to traditional parser gener-
ator systems such as yacc and bison, tools that create parsers from language
1438 | Chapter 19: Text and Language