[Python编程(第4版)].(Programming.Python.4th.Edition).Mark.Lutz.文字版

(yzsuai) #1

Parsers: markup
XML and HTML text parsing


Parsers: grammars
Custom language parsers, both handcoded and generated


Embedding
Running Python code with eval and exec built-ins


And more
Natural language processing


For simpler tasks, Python’s built-in string object is often all we really need. Python
strings can be indexed, concatenated, sliced, and processed with both string method
calls and built-in functions. Our main emphasis in this chapter is mostly on higher-
level tools and techniques for analyzing textual information and language, but we’ll
briefly explore each of these techniques in turn. Let’s get started.


Some readers may have come to this chapter seeking coverage of Uni-
code text, too, but this topic is not presented here. For a look at Python’s
Unicode support, see Chapter 2’s discussion of string tools, Chap-
ter 4 ’s discussion of text and binary file distinctions and encodings, and
Chapter 9’s coverage of text in tkinter GUIs. Unicode also appears in
various Internet and database topics throughout this book (e.g., email
encodings).
Because Unicode is a core language topic, all these chapters will also
refer you to the fuller coverage of Unicode in Learning Python, Fourth
Edition. Most of the topics in this chapter, including string methods and
pattern matching, apply to Unicode automatically simply because the
Python 3.X str string type is Unicode, whether ASCII or wider.

String Method Utilities


The first stop on our text and language tour is the most basic: Python’s string objects
come with an array of text processing tools, and serve as your first line of defense in
this domain. As you undoubtedly know by now, concatenation, slicing, formatting,
and other string expressions are workhorses of most programs (I’m including the newer
format method in this category, as it’s really just an alternative to the % expression):


>>> 'spam eggs ham'[5:10] # slicing: substring extraction
'eggs '
>>> 'spam ' + 'eggs ham' # concatenation (and *, len(), [ix])
'spam eggs ham'
>>> 'spam %s %s' % ('eggs', 'ham') # formatting expression: substitution
'spam eggs ham'
>>> 'spam {} {}'.format('eggs', 'ham') # formatting method: % alternative
'spam eggs ham'

>>> 'spam = "%-5s", %+06d' % ('ham', 99) # more complex formatting

1406 | Chapter 19: Text and Language

Free download pdf