Joe 7:18 pm on March 10, 2015
Tags: HTML ( 2 ), lxml ( 5 ), screen-scraping ( 3 )

lxml.html basic parse methods for [url, file, string]

lxml.html examples of parsing from

URLs
Files
Strings

URLs

import lxml.html
htmltree = lxml.html.parse('http://joecodeswell.com')

htmltree.xpath("//title")[0].text

'''
OUTPUT:
'JoeCodeswell.com'
'''

Files

N.B. Save ‘http://joecodeswell.com’ as a file named ‘JoeCodeswell.com.htm’.
Make sure to cd to the dir containing the file before running the following.

import lxml.html
htmltree = lxml.html.parse('JoeCodeswell.com.htm')

htmltree.xpath("//title")[0].text

'''
OUTPUT:
'JoeCodeswell.com'
'''

Strings

N.B. Save ‘http://joecodeswell.com’ as a file named ‘JoeCodeswell.com.htm’.
Make sure to cd to the dir containing the file before running the following.

import lxml.html

f = open('JoeCodeswell.com.htm', 'r'); the_string = f.read(); f.close()
htmltree = lxml.html.fromstring(the_string)

htmltree.xpath("//title")[0].text

'''
OUTPUT:
'JoeCodeswell.com'
'''

#html, #lxml, #screen-scraping

s: search
c: compose new post
r: reply
e: edit
t: go to top
j: go to the next post or comment
k: go to the previous post or comment
o: toggle comment visibility
esc: cancel edit post or comment

Joe Codeswell – Notes to Myself and Others

Joe Dorocak's Programing Blog (We have a new look – Automatic's P2 Theme)

Monthly Archives: March 2015