Chapter 6 ■ tLS/SSL
94
What would an observer know—where an “observer” could be anyone else connected to the coffee shop’s wireless
network or who has control of one of the routers between it and the rest of the Internet? The observer will first see
your machine make a DNS query for pypi.python.org, and unless there are many other web sites hosted at the IP
address that comes back, they will guess that your subsequent conversations with that IP address at port 443 are for
the purpose of viewing https://pypi.python.org web pages. They will know the difference between your HTTP
requests and the server’s responses because HTTP is a lock-step protocol where each request gets written out in
its entirety before a response is then written back. Furthermore, they will know roughly the size of each returned
document, as well as the order in which they were fetched.
Think of what the observer could learn! Different pages at https://pypi.python.org will have different sizes,
which the observer could catalog by scanning the site with a web scraper (see Chapter 11). Different genres of pages
will involve different constellations of images and other resources that are referenced in the HTML and need to be
downloaded on first viewing or if they have expired from your browser’s cache. While the outside observer might not
know exactly the searches that you enter and the packages that you eventually visit or download, they will often be
able to make a good guess based on the rough sizes of the files that they see you fetch.
The big question about how to keep your browsing habits secret, or to conceal any other personal data that
travels across the public Internet, is far beyond the scope of this book and will involve research into mechanisms
such as online anonymity networks (Tor has been in the news lately, for example) and anonymous remailers. Even
when such mechanisms are employed, your machine is still likely to send and receive blocks of data whose size might
be used to guess what you are doing. A powerful enough adversary might even note that your pattern of requests
corresponds with payloads exiting the anonymous network elsewhere to reach a particular destination.
The rest of this chapter will focus instead on the narrower question of what TLS can achieve and how your Python
code can effectively use it.
What Could Possibly Go Wrong?
To tour the essential features of TLS, you will consider the series of challenges that the protocol itself faces when
establishing a connection and learn how each hurdle is faced and overcome.
Let’s presume you want to open a TCP conversation with a particular hostname and port number somewhere on
the Internet and that you have reluctantly accepted that your DNS lookup of the hostname will be public knowledge,
as will the port number to which you are connecting (which will reveal the protocol you are speaking, unless you
are connecting to a service whose owner has bound it to a nonstandard or misleading port number). You would go
ahead and make a standard TCP connection to the IP address and port. If the protocol you are speaking requires
an introduction between turning on encryption, those first few bytes would pass in the clear for everyone to see.
(Protocols vary in this detail—HTTPS does not send anything before turning on encryption, but SMTP exchanges
several lines of text. You will learn the behavior of several major protocols later in this chapter.)
Once you have the socket up and running and have exchanged whatever pleasantries your protocol dictates
to prepare the way for encryption, it is time for TLS to take over and begin to build strong guarantees about both to
whom you are talking and how you and the peer (the other party) to whom you are speaking will protect data from
prying eyes.
The first demand of your TLS client will be that the remote server provide a binary document called a certificate,
which includes what cryptologists call a public key—an integer that can be used to encrypt data, such that only the
possessor of the corresponding private key integer can decrypt the information and understand it. If the remote server
is correctly configured and has never been compromised, then it will both possess a copy of the private key and be the
only server on the Internet (with the possible exception of the other machines in its cluster) that holds such a copy.
How can your TLS implementation verify that the remote server actually holds the private key? Simple! Your TLS
library sends some information across the wire that has been encrypted using the public key, and it demands that the
remote server provide a checksum demonstrating that the data was decrypted successfully with the secret key.
Your TLS stack must also turn its attention to the question of whether the remote certificate has been forged. After
all, anyone with access to the openssl command-line tool (or any of a number of other tools) can create a certificate
whose common name is cn=www.google.com or cn=pypi.python.org or anything else. Why would you trust such a
claim? The solution is for your TLS session to keep a list of certificate authorities (CAs) that it trusts to verify Internet