Learning Python Network Programming

(Sean Pound) #1
Chapter 3

The lxml library's ElementTree implementation has been designed to be 100 percent
compatible with the standard library's, so we can start exploring the document in the
same way as we did with XML:





[e.tag for e in root]





['head', 'body']





root.find('head').find('title').text





'Debian –- Debian \u201cjessie\u201d Release Information'


In the preceding code, we have printed out the text content of the document's


element, which is the text that appears in the tab in the preceding<br /> screenshot. We can already see it contains the codename that we want.<br /> <h3>Zeroing in</h3><br /> <p>Screen scraping is the art of finding a way to unambiguously address the elements<br /> in the HTML that contain the information that we want, and extract the information<br /> from only those elements.</p><br /> <p>However, we also want the selection criteria to be as simple as possible. The less we<br /> rely on the contents of the document, the lesser the chance of it being broken if the<br /> page's HTML changes.</p><br /> <p>Let's inspect the HTML source of the page, and see what we're dealing with. For this,<br /> either use View Source in a web browser, or save the HTML to a file and open it in<br /> a text editor. The page's source code is also included in the source code download<br /> for this book. Search for the text Debian 8.0, so that we are taken straight to the<br /> information we want. For me, it looks like the following block of code:</p><br /> <pre><code><body><br /> ...<br /> <div id="content"><br /> <h1>Debian “jessie” Release Information</h1><br /> <p>Debian 8.0 was<br /> released October 18th, 2014.<br /> The release included many major<br /> changes, described in<br /> ...</code></pre><br /> <p>I've skipped the HTML between the <body> and the <div> to show that the <div><br /> is a direct child of the <body> element. From the above, we can see that we want the<br /> contents of the <p> tag child of the <div> element.</p> </div> <meta itemprop='headline' content="p 130: Zeroing in - Learning Python Network Programming - free download pdf - issuhub"> </div> <div role="navigation" itemscope itemtype="http://schema.org/SiteNavigationElement"> <span itemprop="url"><b><a href="/view/index?id=3992&pageIndex=129" rel="previous" itemprop="name">← Previous</a></b></span> <span itemprop="url" class="mx-3"><b><a href="/view/index?id=3992&pageIndex=131" rel="next" itemprop="name">Next →</a></b></span> </div> <div style=" text-align: center; margin: 20px auto; padding: 13px; width: 240px; font-size: 20px; "> <a class="page-link" style="background-color: #72bf86;" target="_blank" href="/view/index?id=3992&pageIndex=129#bookdownload" title="Free download pdf" >Free download pdf</a> </div> </div> <div class="footer"> <div class="container"> <div class="row"> <div class="col-lg-3 ml-lg-auto mb-5 mb-lg-0"> <div class="mb-4"> <h5 class="text-dark">Get our desktop app</h5> </div> <a class="btn btn-icon btn-indigo rounded-circle mr-2" target="_blank" href="/download/issuhub.dmg"> <i class="fa fa-apple"></i> </a> <a class="btn btn-icon btn-indigo rounded-circle" target="_blank" href="/download/issuhub.exe"> <i class="fa fa-windows"></i> </a> </div> <div class="col-6 col-md-3 col-lg mb-5 mb-lg-0"> <h5 class="text-dark">Company</h5> <!-- Nav Link --> <ul class="list-unstyled mb-0"> <li class="my-2"><a href="/about">About</a></li> <li class="my-2"><a href="/contact">Contact</a></li> <li class="my-2"><a href="/news/index">News</a></li> </ul> <!-- End Nav Link --> </div> <div class="col-6 col-md-3 col-lg mb-5 mb-lg-0"> <h5 class="text-dark">Features</h5> <!-- Nav Link --> <ul class="list-unstyled mb-0"> <li class="my-2"><a href="/quick">Quick Start</a></li> <li class="my-2"><a href="/desktop">Desktop</a></li> <li class="my-2"><a href="/editor-help">Editor</a></li> </ul> <!-- End Nav Link --> </div> <div class="col-6 col-md-3 col-lg"> <h5 class="text-dark">Documentation</h5> <!-- Nav Link --> <ul class="list-unstyled mb-0"> <li class="my-2"><a href="/support/index">Support</a></li> <li class="my-2"><a href="/site/pricing">Pricing</a></li> </ul> <!-- End Nav Link --> </div> <div class="col-6 col-md-3"> <h5 class="text-dark">Resources</h5> <!-- Nav Link --> <ul class="list-unstyled mb-0"> <li class="my-2"> <a href="/tutorial" target="_blank"> <span class="media align-items-center"> <i class="fa fa-info-circle mr-2"></i> <span class="media-body">Tutorial</span> </span> </a> </li> <li class="my-2"> <a href="/site/login"> <span class="media align-items-center"> <i class="fa fa-user-circle mr-2"></i> <span class="media-body">Your Account</span> </span> </a> </li> </ul> <!-- End Nav Link --> </div> </div> </div> </div> <div class="footer"> <div class="container"> <div class="row"> <div class="col-md-6 mb-4 mb-md-0"> <!-- Nav Link --> <ul class="nav nav-sm nav-white nav-x-sm align-items-center"> <li class="my-2"> <a href="/privacy">Privacy & Policy</a> </li> <li class=" opacity my-2 mx-3">/</li> <li class="my-2"> <a href="/terms">Terms</a> </li> </ul> <!-- End Nav Link --> </div> <div class="col-md-6 text-md-right"> <ul class="list-inline mb-0"> <!-- Social Networks --> <li class="list-inline-item"> <a class="btn btn-xs btn-icon btn-soft-light" href="https://www.facebook.com/Issuhub-Flipbook-2315543688769343/"> <i class="fa fa-facebook text-dark"></i> </a> </li> <li class="list-inline-item"> <a class="btn btn-xs btn-icon btn-soft-light" href="https://twitter.com/IssuhubBooks"> <i class="fa fa-twitter text-dark"></i> </a> </li> <!-- End Social Networks --> </ul> </div> </div> <!-- Copyright --> <div style="text-align: center;"> <div >© ISSUHUB. 2024. All rights reserved.</div> </div> <!-- End Copyright --> </div> </div> </div> </div> <script src="/assets/6df76c57/assets/js/vendors/jquery-3.2.1.min.js"></script> <script src="/assets/6df76c57/assets/js/vendors/bootstrap.bundle.min.js"></script></body> </html>