GREAT lxml example links
- Get element tree for scraping
- Syntax help
- Parsing html
- Parsing html – 2
- Parsing html – 3
- Select next following sibling
- Get all td’s
- Selecting attribute values from lxml
HTML Node vs Element
W3C HTML Nodes may be:
- Document — Element (maximum of one), ProcessingInstruction, Comment, DocumentType
- DocumentFragment — Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference
- DocumentType — no children
- EntityReference — Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference
- Element — Element, Text, Comment, ProcessingInstruction, CDATASection, EntityReference
- Attr — Text, EntityReference
- ProcessingInstruction — no children
- Comment — no children
- Text — no children
- CDATASection — no children
- Entity — Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference
- Notation — no children # My Scrape HTML File Example
Background
This, http://joecodeswell.org/examples/dlwebfiles/index.html, is the URL we will be scraping using lmxl, to find files to download, and using urllib to download them. Here is what the URL content looks like.
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> <html> <head> <meta http-equiv="content-type" content="text/html; charset=windows-1252"> <title>dlwebfiles</title> </head> <body> <h1>Index of Joe Codeswell examples - dlwebfiles</h1> <ul> <li><a href="http://joecodeswell.org/examples/dlwebfiles/aveverum.mid">aveverum.mid</a></li> <li><a href="http://joecodeswell.org/examples/dlwebfiles/carol.mid">carol.mid</a></li> <li><a href="http://joecodeswell.org/examples/dlwebfiles/steiner.mid">steiner.mid</a></li> </ul> </body> </html>
Step-1
Create a folder structure on your local machine == “example_folder/mid"
.
Step-2
Put the following Python code into a file in example_folder
, naming it “retrieveMidis.py
“.
# -*- coding: UTF-8 -*- # retrieveMidis.py import os, lxml.html, urllib inScrapeUrl = 'http://joecodeswell.org/examples/dlwebfiles/index.html' outDataFolderPath = os.path.join('mid') # parse the html htmltree = lxml.html.parse(inScrapeUrl) # retrieve the midi files to the ./mid dir theLiList = htmltree.xpath('/html/body/ul/li') opener = urllib.URLopener() for li in theLiList: # see http://www.w3schools.com/xpath/xpath_syntax.asp theHref = li.xpath('a')[0].attrib.get('href') theBasename = os.path.basename(theHref) theExtension = os.path.splitext(theBasename)[1] if len(theBasename) != 0: print "theHref = %s"%(theHref) print "theBasename = %s"%(theBasename) print "len(theBasename) = %s"%(len(theBasename)) print "theExtension = %s"%(theExtension) print "os.path.join(outDataFolderPath,theBasenme) = %s"%(os.path.join(outDataFolderPath,theBasename)) print print opener.retrieve(theHref, os.path.join(outDataFolderPath,theBasename))
Step-3
Run retrieveMidis.py.
Here’s what the output looks like on Win XP.
>retrieveMidis.py theHref = http://joecodeswell.org/examples/dlwebfiles/aveverum.mid theBasename = aveverum.mid len(theBasename) = 12 theExtension = .mid os.path.join(outDataFolderPath,theBasenme) = midaveverum.mid theHref = http://joecodeswell.org/examples/dlwebfiles/carol.mid theBasename = carol.mid len(theBasename) = 9 theExtension = .mid os.path.join(outDataFolderPath,theBasenme) = midcarol.mid theHref = http://joecodeswell.org/examples/dlwebfiles/steiner.mid theBasename = steiner.mid len(theBasename) = 11 theExtension = .mid os.path.join(outDataFolderPath,theBasenme) = midsteiner.mid