Example 2 – reading HTML documents using lxml.html

In this example, we will be using the lxml.html module to traverse through the elements from http://httpbin.org/forms/post:

from lxml import html
from urllib.request import urlopen

root = html.parse(urlopen('http://httpbin.org/forms/post')).getroot()
tree = html.parse(urlopen('http://httpbin.org/forms/post'))

print(type(root)) #<class 'lxml.html.HtmlElement'>
print(type(tree)) #<class 'lxml.etree._ElementTree'>

We are using parse() from lxml.html to load the given URL content. parse() acts similarly to lxml.etree but, in this case, root obtained is of the HTML type. The getroot() method returns the document root. The object type can be compared for root and tree, as shown in the preceding code. We are interested in root or HTMLElement for this example. The content parsed as root is shown in the following screenshot:

Page source: http://httpbin.org/forms/post

HTMLElement root has various properties, as listed here:

print(dir(root)) 

[...'addnext', 'addprevious', 'append', 'attrib', 'base', 'base_url', 'body', 'clear', 'cssselect', 'drop_tag', 'drop_tree', 'extend', 'find', 'find_class', 'find_rel_links', 'findall', 'findtext', 'forms', 'get', 'get_element_by_id', 'getchildren', 'getiterator', 'getnext', 'getparent', 'getprevious', 'getroottree', 'head', 'index', 'insert', 'items', 'iter', 'iterancestors', 'iterchildren', 'iterdescendants', 'iterfind', 'iterlinks', 'itersiblings', 'itertext', 'keys', 'label', 'make_links_absolute', 'makeelement', 'nsmap', 'prefix', 'remove', 'replace', 'resolve_base_href', 'rewrite_links', 'set', 'sourceline', 'tag', 'tail', 'text', 'text_content', 'values', 'xpath']

Let's find <p> from root; find() can be used to locate the first element by the path. Text can be retrieved using the text_content() function. The findtext() function can also be used for similar cases, as shown here:

p = root.find('.//p') #find first <p> from root

print(p.text_content()) # Customer name:
print(root.findtext('.//p/label')) #Customer name:

As we can see in the following code, findall() is used to find and iterate through all of the elements in root:

elemP = root.findall('.//p') #find all <p> element from root
for p in elemP :
print(p.text_content())

The preceding code lists the text from finding all p tags, as seen here:

Customer name: 
Telephone:
E-mail address:
Small
Medium
Large
Bacon
Extra Cheese
Onion
Mushroom
Preferred delivery time:
Delivery instructions:
Submit order

The HTMLElement root also supports XPath and CSSSelect:

print(root.xpath('//p/label/input/@value'))
print(root.xpath('//legend/text()'))

This will result in the output seen here:

['small','medium','large','bacon','cheese','onion','mushroom']
['Pizza Size', 'Pizza Toppings']

CSSSelect translates CSS selectors into XPath expressions and is used with a related object:

#print text_content() for label inside <p>
for e in root.cssselect('p label'):
print(e.text_content())

Customer name:
Telephone:
E-mail address:
Small
......
Mushroom
Preferred delivery time:
Delivery instructions:

#print text_content for element <p> inside <form>
for e in root.cssselect('form > p'):
print(e.text_content())

Customer name:
Telephone:
E-mail address:
Preferred delivery time:
Delivery instructions:
Submit order

The following code demonstrates the HTML <form> element being explored for its attributes and properties. We are targeting the <form> element first, which is found in root, that is, <form method="post" action="/post">:

print(root.forms[0].action)  #http://httpbin.org/post
print(root.forms[0].keys()) #['method', 'action']
print(root.forms[0].items()) #[('method', 'post'), ('action', '/post')]
print(root.forms[0].method) # POST

As we can see from the preceding code, outputs are displayed as in-line comments:

  • action returns the URL value for the key attribute, action. The URL obtained is actually a link that will process the information submitted or options chosen.
  • items() returns the list of tuples containing the element's key and value.
  • keys() returns the list of element keys.
  • method returns the value for the attribute method, that is, HTTP request or HTTP methods. For more information on HTTP methods, please refer to Chapter 1, Web Scraping Fundamentals, the Understanding web development and technologies section.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.187.18