Scrape HTML Files with lxml

GREAT lxml example links

HTML Node vs Element

W3C HTML Nodes may be:

  • Document — Element (maximum of one), ProcessingInstruction, Comment, DocumentType
  • DocumentFragment — Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference
  • DocumentType — no children
  • EntityReference — Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference
  • Element — Element, Text, Comment, ProcessingInstruction, CDATASection, EntityReference
  • Attr — Text, EntityReference
  • ProcessingInstruction — no children
  • Comment — no children
  • Text — no children
  • CDATASection — no children
  • Entity — Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference
  • Notation — no children # My Scrape HTML File Example

Background

This, http://joecodeswell.org/examples/dlwebfiles/index.html, is the URL we will be scraping using lmxl, to find files to download, and using urllib to download them. Here is what the URL content looks like.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=windows-1252">
    <title>dlwebfiles</title>
  </head>
  <body>
    <h1>Index of Joe Codeswell examples - dlwebfiles</h1>
    <ul>
      <li><a href="http://joecodeswell.org/examples/dlwebfiles/aveverum.mid">aveverum.mid</a></li>
      <li><a href="http://joecodeswell.org/examples/dlwebfiles/carol.mid">carol.mid</a></li>
      <li><a href="http://joecodeswell.org/examples/dlwebfiles/steiner.mid">steiner.mid</a></li>
    </ul>
  </body>
</html>

Step-1

Create a folder structure on your local machine == “example_folder/mid".

Step-2

Put the following Python code into a file in example_folder, naming it “retrieveMidis.py“.

# -*- coding: UTF-8 -*-
# retrieveMidis.py

import os, lxml.html, urllib

inScrapeUrl = 'http://joecodeswell.org/examples/dlwebfiles/index.html'
outDataFolderPath = os.path.join('mid')

# parse the html
htmltree = lxml.html.parse(inScrapeUrl)

# retrieve the midi files to the ./mid dir
theLiList = htmltree.xpath('/html/body/ul/li')  
opener = urllib.URLopener()
for li in theLiList:
    # see http://www.w3schools.com/xpath/xpath_syntax.asp
    theHref = li.xpath('a')[0].attrib.get('href')
    theBasename = os.path.basename(theHref)
    theExtension = os.path.splitext(theBasename)[1]
    if len(theBasename) != 0:
        print "theHref = %s"%(theHref)
        print "theBasename = %s"%(theBasename)
        print "len(theBasename) = %s"%(len(theBasename))
        print "theExtension = %s"%(theExtension)
        print "os.path.join(outDataFolderPath,theBasenme) = %s"%(os.path.join(outDataFolderPath,theBasename))
        print
        print
        opener.retrieve(theHref, os.path.join(outDataFolderPath,theBasename))

Step-3

Run retrieveMidis.py.

Here’s what the output looks like on Win XP.

>retrieveMidis.py
theHref = http://joecodeswell.org/examples/dlwebfiles/aveverum.mid
theBasename = aveverum.mid
len(theBasename) = 12
theExtension = .mid
os.path.join(outDataFolderPath,theBasenme) = midaveverum.mid


theHref = http://joecodeswell.org/examples/dlwebfiles/carol.mid
theBasename = carol.mid
len(theBasename) = 9
theExtension = .mid
os.path.join(outDataFolderPath,theBasenme) = midcarol.mid


theHref = http://joecodeswell.org/examples/dlwebfiles/steiner.mid
theBasename = steiner.mid
len(theBasename) = 11
theExtension = .mid
os.path.join(outDataFolderPath,theBasenme) = midsteiner.mid


Advertisements