If and when I need more details about how two reported files actually differ, I either
edit the files or run the file-comparison command on the host platform (e.g., fc on
Windows/DOS, diff or cmp on Unix and Linux). That’s not a portable solution for this
last step; but for my purposes, just finding the differences in a 1,400-file tree was much
more critical than reporting which lines differ in files flagged in the report.
Of course, since we can always run shell commands in Python, this last step could be
automated by spawning a diff or fc command with os.popen as differences are en-
countered (or after the traversal, by scanning the report summary). The output of these
system calls could be displayed verbatim, or parsed for relevant parts.
We also might try to do a bit better here by opening true text files in text mode to ignore
line-terminator differences caused by transferring across platforms, but it’s not clear
that such differences should be ignored (what if the caller wants to know whether line-
end markers have been changed?). For example, after downloading a website with an
FTP script we’ll meet in Chapter 13, the diffall script detected a discrepancy between
the local copy of a file and the one at the remote server. To probe further, I simply ran
some interactive Python code:
>>> a = open('lp2e-updates.html', 'rb').read()
>>> b = open(r'C:\Mark\WEBSITE\public_html\lp2e-updates.html', 'rb').read()
>>> a == b
False
This verifies that there really is a binary difference in the downloaded and local versions
of the file; to see whether it’s because a Unix or DOS line end snuck into the file, try
again in text mode so that line ends are all mapped to the standard \n character:
>>> a = open('lp2e-updates.html', 'r').read()
>>> b = open(r'C:\Mark\WEBSITE\public_html\lp2e-updates.html', 'r').read()
>>> a == b
True
Sure enough; now, to find where the difference is, the following code checks character
by character until the first mismatch is found (in binary mode, so we retain the
difference):
>>> a = open('lp2e-updates.html', 'rb').read()
>>> b = open(r'C:\Mark\WEBSITE\public_html\lp2e-updates.html', 'rb').read()
>>> for (i, (ac, bc)) in enumerate(zip(a, b)):
... if ac != bc:
... print(i, repr(ac), repr(bc))
... break
...
37966 '\r' '\n'
This means that at byte offset 37,966, there is a \r in the downloaded file, but a \n in
the local copy. This line has a DOS line end in one and a Unix line end in the other. To
see more, print text around the mismatch:
318 | Chapter 6: Complete System Programs