Chapter 6
#
239 222 205 Almond
205 149 117 Antique Brass
We can parse a text file using regular expressions. We need to use a filter to read
(and parse) header rows. We also want to return an iterable sequence of data rows.
This rather complex two-part parsing is based entirely on the two-part – head and
tail – file structure.
Following is a low-level parser that handles both head and tail:
def row_iter_gpl(file_obj):
header_pat= re.compile(r"GIMP
Palette\nName:\s(.?)\nColumns:\s(.?)\n#\n", re.M)
def read_head(file_obj):
match= header_pat.match("".join( fileobj.readline() for in
range(4)))
return (match.group(1), match.group(2)), file_obj
def read_tail(headers, file_obj):
return headers, (next_line.split() for next_line in file_obj)
return read_tail(*read_head(file_obj))
We've defined a regular expression that parses all four lines of the header, and
assigned this to the header_pat variable. There are two internal functions for parsing
different parts of the file. The read_head() function parses the header lines. It does
this by reading four lines and merging them into a single long string. This is then
parsed with the regular expression. The results include the two data items from the
header plus an iterator ready to process additional lines.
The read_tail() function accepts the output from the read_head() function and
parses the iterator over the remaining lines. The parsed information from the header
rows forms a two tuple that is given to the read_tail() function along with the
iterator over the remaining lines. The remaining lines are merely split on spaces,
since that fits the description of the GPL file format.
For more information, visit the following link:
https://code.google.com/p/grafx2/issues/detail?id=518.
Once we've transformed each line of the file into a canonical tuple-of-strings format,
we can apply the higher level of parsing to this data. This involves conversion and
(if necessary) validation.