lxml.html basic parse methods for [url, file, string]

lxml.html examples of parsing from

  • URLs
  • Files
  • Strings

URLs

import lxml.html
htmltree = lxml.html.parse('http://joecodeswell.com')

htmltree.xpath("//title")[0].text

'''
OUTPUT:
'JoeCodeswell.com'
'''

Files

N.B. Save ‘http://joecodeswell.com’ as a file named ‘JoeCodeswell.com.htm’.
Make sure to cd to the dir containing the file before running the following.

import lxml.html
htmltree = lxml.html.parse('JoeCodeswell.com.htm')

htmltree.xpath("//title")[0].text

'''
OUTPUT:
'JoeCodeswell.com'
'''

Strings

N.B. Save ‘http://joecodeswell.com’ as a file named ‘JoeCodeswell.com.htm’.
Make sure to cd to the dir containing the file before running the following.

import lxml.html

f = open('JoeCodeswell.com.htm', 'r'); the_string = f.read(); f.close()
htmltree = lxml.html.fromstring(the_string)

htmltree.xpath("//title")[0].text

'''
OUTPUT:
'JoeCodeswell.com'
'''

#html, #lxml, #screen-scraping