Scrape HTML Files with lxml

GREAT lxml example links

HTML Node vs Element

W3C HTML Nodes may be:

  • Document — Element (maximum of one), ProcessingInstruction, Comment, DocumentType
  • DocumentFragment — Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference
  • DocumentType — no children
  • EntityReference — Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference
  • Element — Element, Text, Comment, ProcessingInstruction, CDATASection, EntityReference
  • Attr — Text, EntityReference
  • ProcessingInstruction — no children
  • Comment — no children
  • Text — no children
  • CDATASection — no children
  • Entity — Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference
  • Notation — no children # My Scrape HTML File Example

Background

This, http://joecodeswell.org/examples/dlwebfiles/index.html, is the URL we will be scraping using lmxl, to find files to download, and using urllib to download them. Here is what the URL content looks like.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=windows-1252">
    <title>dlwebfiles</title>
  </head>
  <body>
    <h1>Index of Joe Codeswell examples - dlwebfiles</h1>
    <ul>
      <li><a href="http://joecodeswell.org/examples/dlwebfiles/aveverum.mid">aveverum.mid</a></li>
      <li><a href="http://joecodeswell.org/examples/dlwebfiles/carol.mid">carol.mid</a></li>
      <li><a href="http://joecodeswell.org/examples/dlwebfiles/steiner.mid">steiner.mid</a></li>
    </ul>
  </body>
</html>

Step-1

Create a folder structure on your local machine == “example_folder/mid".

Step-2

Put the following Python code into a file in example_folder, naming it “retrieveMidis.py“.

# -*- coding: UTF-8 -*-
# retrieveMidis.py

import os, lxml.html, urllib

inScrapeUrl = 'http://joecodeswell.org/examples/dlwebfiles/index.html'
outDataFolderPath = os.path.join('mid')

# parse the html
htmltree = lxml.html.parse(inScrapeUrl)

# retrieve the midi files to the ./mid dir
theLiList = htmltree.xpath('/html/body/ul/li')  
opener = urllib.URLopener()
for li in theLiList:
    # see http://www.w3schools.com/xpath/xpath_syntax.asp
    theHref = li.xpath('a')[0].attrib.get('href')
    theBasename = os.path.basename(theHref)
    theExtension = os.path.splitext(theBasename)[1]
    if len(theBasename) != 0:
        print "theHref = %s"%(theHref)
        print "theBasename = %s"%(theBasename)
        print "len(theBasename) = %s"%(len(theBasename))
        print "theExtension = %s"%(theExtension)
        print "os.path.join(outDataFolderPath,theBasenme) = %s"%(os.path.join(outDataFolderPath,theBasename))
        print
        print
        opener.retrieve(theHref, os.path.join(outDataFolderPath,theBasename))

Step-3

Run retrieveMidis.py.

Here’s what the output looks like on Win XP.

>retrieveMidis.py
theHref = http://joecodeswell.org/examples/dlwebfiles/aveverum.mid
theBasename = aveverum.mid
len(theBasename) = 12
theExtension = .mid
os.path.join(outDataFolderPath,theBasenme) = midaveverum.mid


theHref = http://joecodeswell.org/examples/dlwebfiles/carol.mid
theBasename = carol.mid
len(theBasename) = 9
theExtension = .mid
os.path.join(outDataFolderPath,theBasenme) = midcarol.mid


theHref = http://joecodeswell.org/examples/dlwebfiles/steiner.mid
theBasename = steiner.mid
len(theBasename) = 11
theExtension = .mid
os.path.join(outDataFolderPath,theBasenme) = midsteiner.mid


KVLang Series – 6

KVLang Series – 6

← Previous                     

Class Rules

Content:
– .kv file
– .py file
– screenshot of output

0006_classRules.kv

Class Rules allow you to make your own widget from previously defined widgets.
A Class Rule defines how any instance of that widget class will appear.
See:
Rule context
Kivy Language
Referencing Widgets
Accessing Widgets defined inside Kv lang in your python code

<LblTxtBtn>:
    orientation: 'horizontal'

    Label:
        text: root.l_text
    TextInput:
        text: root.t_text
    Button:
        text: root.b_text


BoxLayout:
    orientation: 'vertical'    

    LblTxtBtn:
        l_text: 'init_lbl-1'
        t_text: 'init_text-1'
        b_text: 'init_btn-1'

    LblTxtBtn:
        l_text: 'init_lbl-2'
        t_text: 'init_text-2'
        b_text: 'init_btn-2'

    LblTxtBtn:
        l_text: 'init_lbl-3'
        t_text: 'init_text-3'
        b_text: 'init_btn-3'

0006_classRules.py

This is adapted from previous step. For Class Rules, as opposed to the upcoming Dynamic Classes, we need MAJOR ADDITIONS.
We need to ADD objects to allow us to use Class Rules and their properties in our .kv file.
Note how we need to look back and forth from the .kv file to the .py file to see what’s going on with the porperties.
This will NOT BE NECESSARY when we do the upcoming Dynamic Classes. Then the info will be contained in the .kv file.

  1. EDIT

    CHANGE: The OLD self.root = Builder.load_file(”) TO the NEW

  2. ADD – imports for LblTxtBtn classRule

    AFTER: from kivy.config import Config

    INSERT:

    from kivy.uix.boxlayout import BoxLayout

    from kivy.properties import StringProperty

  3. ADD – Class Code for LblTxtBtn classRule

    AFTER: Config.set(‘graphics’, ‘height’, ’90’)

    INSERT:

    class LblTxtBtn(BoxLayout):

    l_text = StringProperty('')
    
    t_text = StringProperty('')
    
    b_text = StringProperty('')    
    
''' 0006_classRules.py
Adds class LblTxtBtn(BoxLayout), etc. to display 0006_classRules.kv
'''

import kivy
kivy.require('1.8.0') # replace with your current kivy version !

from kivy.app import App
from kivy.lang import Builder
from kivy.config import Config

from kivy.uix.boxlayout import BoxLayout
from kivy.properties import StringProperty 


Config.set('graphics', 'width',  '323')
Config.set('graphics', 'height', '90')

class LblTxtBtn(BoxLayout):

    l_text = StringProperty('')
    t_text = StringProperty('')
    b_text = StringProperty('')

class MyApp(App):

    def build(self):
        self.root = Builder.load_file('0006_classRules.kv')
        return self.root

if __name__ == '__main__':
    MyApp().run()

0006_classRules ScreenShot

Here is what this looks like run on Windows XP. In Pixels, it has:
– width: 323
– height: 200

Alt 0006_classRules.png

KVLang Series – 5

KVLang Series – 5

← Previous              Next →

Nested Layouts

Content:
– .kv file
– .py file
– screenshot of output

0005_nestedLayouts.kv

In this example we nest 3 horizontlly oriented BoxLayouts inside a single vertical BoxLayout which we use as the root rule.

BoxLayout:
    orientation: 'vertical'
    BoxLayout:
        orientation: 'horizontal'
        Label:
            text: 'Input1:'
        TextInput:
            text: 'Default Text1'
        Button:
            text: 'Press Me1'
    BoxLayout:
        orientation: 'horizontal'
        Label:
            text: 'Input2:'
        TextInput:
            text: 'Default Text2'
        Button:
            text: 'Press Me2'
    BoxLayout:
        orientation: 'horizontal'
        Label:
            text: 'Input3:'
        TextInput:
            text: 'Default Text3'
        Button:
            text: 'Press Me3'

0005_nestedLayouts.py

Just one minor change to update the .kv filename. By the way, if you read How to load KV, you can see an alternate way to associate the .py file with the .kv file.

  1. EDIT

    CHANGE: The OLD self.root = Builder.load_file(”) TO the NEW

''' 0005_nestedLayouts.py
Python file to display 0005_nestedLayouts.kv. 
'''

import kivy
kivy.require('1.8.0') # replace with your current kivy version !

from kivy.app import App
from kivy.lang import Builder
from kivy.config import Config

Config.set('graphics', 'width',  '323')
Config.set('graphics', 'height', '90')

class MyApp(App):

    def build(self):
        self.root = Builder.load_file('0005_nestedLayouts.kv')
        return self.root

if __name__ == '__main__':
    MyApp().run()

0005_nestedLayouts ScreenShot

Here is what this looks like run on Windows XP. In Pixels, it has:
– width: 323
– height: 200

Alt 0005_nestedLayouts.png

KVLang Series – 4

KVLang Series – 4

← Previous              Next →

Vertical BoxLayout

Content:
– .kv file
– .py file
– screenshot of output

0004_verticalBoxLayout.kv

Just add the BoxLayout orientation: ‘vertical’ property. This arranges its contents in a vertical manner instead of the default ‘horizontal’.

BoxLayout:
    orientation: 'vertical'
    Label:
        text: 'Input:'
    TextInput:
        text: 'Default Text'
    Button:
        text: 'Press Me'

0004_verticalBoxLayout.py

This is adapted from previous python file. Again we make minor changes to update the .kv filename and give a new window height.

  1. EDIT

    CHANGE: The OLD self.root = Builder.load_file(”) TO the NEW

  2. EDIT

    WAS: Config.set(‘graphics’, ‘height’, ’30’)

    NOW: Config.set(‘graphics’, ‘height’, ’90’)

''' 0004_verticalBoxLayout.py
Used to display 0004_verticalBoxLayout.kv - Again, no new concepts in here.
'''
import kivy
kivy.require('1.8.0') # replace with your current kivy version !

from kivy.app import App
from kivy.lang import Builder
from kivy.config import Config

Config.set('graphics', 'width',  '323')
Config.set('graphics', 'height', '90')

class MyApp(App):

    def build(self):
        self.root = Builder.load_file('0004_verticalBoxLayout.kv')
        return self.root

if __name__ == '__main__':
    MyApp().run()

0004_verticalBoxLayout ScreenShot

Here is what this looks like run on Windows XP. In Pixels, it has:
– width: 323
– height: 200

Alt 0004_verticalBoxLayout.png

KVLang Series – 3

KVLang Series – 3

← Previous              Next →

More Widgets in a BoxLayout

Content:
– .kv file
– .py file
– screenshot of output

0003_moreWidgets.kv

In the .kv file we add some more widgets under a BoxLayout.
If we added the widgets directly to the file, not being underneath our layout, it would be and ERROR because they would be considered root rules and “you can have one root rule, and any number of class or template rules.”
The BoxLayout has a default orientation: ‘horizontal’.

### 0003_moreWidgets.kv

BoxLayout:
    Label:
        text: 'Input:'
    TextInput:
        text: 'Default Text'
    Button:
        text: 'Press Me'

0003_moreWidgets.py

This is adapted from previous python file. We make minor changes to update the .kv filename and give a new window height.

  1. EDIT

    CHANGE: The OLD self.root = Builder.load_file(”) TO the NEW

  2. EDIT

    WAS: Config.set(‘graphics’, ‘height’, ‘200’)

    NOW: Config.set(‘graphics’, ‘height’, ’30’)

''' 0003_moreWidgets.py
Used to display 0003_moreWidgets.kv - No new concepts in here.
'''
import kivy
kivy.require('1.8.0') # replace with your current kivy version !

from kivy.app import App
from kivy.lang import Builder
from kivy.config import Config

Config.set('graphics', 'width',  '323')
Config.set('graphics', 'height', '30')

class MyApp(App):

    def build(self):
        self.root = Builder.load_file('0003_moreWidgets.kv')
        return self.root

if __name__ == '__main__':
    MyApp().run()

0003_moreWidgets ScreenShot

Here is what this looks like run on Windows XP. In Pixels, it has:
– width: 323
– height: 200

Alt 0003_moreWidgets.png

KVLang Series – 2

KVLang Series – 2

← Previous              Next →

Window Size – Hello World Label Widget

Content:
– .kv file
– .py file
– screenshot of output

0002_windowSize.kv

This is PRETTY MUCH the same as before. Unfortunately (from my perspective), we must alter the .py file to change the window size. If anyone knows a way to do this in a .kv file please post it to the comments. Thanks.

Label:
    text: 'A smaller hello, world.'

0002_windowSize.py

As I said earlier, Unfortunately (from my perspective), we must alter the .py file to change the window size. If anyone knows a way to do this in a .kv file please post it to the comments. Thanks.

This is adapted from 0001_helloWorld.py. Here are the changes:

  1. EDIT – Builder.load_file()

    WAS: self.root = Builder.load_file(‘0001_helloWorld.kv’)

    NOW: self.root = Builder.load_file(‘0002_windowSize.kv’)

  2. ADD – Window Sizing Info

    AFTER: from kivy.lang import Builder

    BEFORE: class MyApp(App):

    INSERT:

    from kivy.config import Config

    Config.set(‘graphics’, ‘width’, ‘323’)

    Config.set(‘graphics’, ‘height’, ‘200’)

''' 0002_windowSize.py
Demonstrates Window Sizing - kivy.config - Config.set('graphics',
'''
import kivy
kivy.require('1.8.0') # replace with your current kivy version !

from kivy.app import App
from kivy.lang import Builder
from kivy.config import Config

Config.set('graphics', 'width',  '323')
Config.set('graphics', 'height', '200')

class MyApp(App):

    def build(self):
        self.root = Builder.load_file('0002_windowSize.kv')
        return self.root

if __name__ == '__main__':
    MyApp().run()

0002_windowSize ScreenShot

Here is what this looks like run on Windows XP. In Pixels, it has:
– width: 323
– height: 200

Alt 0002_windowSize.png