Example 3 – using lxml.cssselect to scrape content from a page

CSS selectors have a broad range of query options as described in the Introduction to XPath and CSS selector section, and is often used as an easy alternative to XPath. In the two preceding examples, we explored the XPath to collect the desired information. In this example, we will be using cssselect from lxml to collect relevant data from a single page available on https://developer.ibm.com/announcements/category/data-science/?fa=date%3ADESC&fb=.

To identify a CSS query, we can browse through the page source or use the DevTools. For more detail on using DevTools, refer to the XPath and CSS selectors using DevTools section. In this case, we are identifying and collecting CSS Query using DevTools, as shown in the following screenshot:

Using DevTools and selecting selector from https://developer.ibm.com/announcements

From the preceding screenshot, we can see that the individual announcements are a block identified by a.ibm--card__block_link found inside div.ibm--card, which possesses HTML elements with classes, such as ibm--card__body, and ibm--card__type. The CSS selector is copied using the described process and will result in the following list for a.ibm--card__block_link and div.ibm--card__body, respectively:

  • #content > div > div.code_main > div > div.cpt-content > div > div.bx--grid.no-pad.cpt--item__row > div:nth-child(1) > div:nth-child(1) > div > a
  • #content > div > div.code_main > div > div.cpt-content > div > div.bx--grid.no-pad.cpt--item__row > div:nth-child(1) > div:nth-child(1) > div > a > div.ibm--card__body

Let's deploy the preceding concept using Python code, as shown in the following snippet:

from lxml import html
import requests
from lxml.cssselect import CSSSelector
url = 'https://developer.ibm.com/announcements/category/data-science/?fa=date%3ADESC&fb='
url_get = requests.get(url)
tree = html.document_fromstring(url_get.content)

The required Python library and URLs are declared and the page content url_get is parsed with lxml.html. With lxml.html.HTMLElement obtained, we can now select and navigate to the desired element in the tree with the XPath or CSS selector:

announcements=[]
articles = tree.cssselect('.ibm--card > a.ibm--card__block_link')

for article in articles:
link = article.get('href')
atype = article.cssselect('div.ibm--card__body > h5')[0].text.strip()
adate = article.cssselect('div.ibm--card__body > h5 > .ibm--card__date')[0].text
title = article.cssselect('div.ibm--card__body > h3.ibm--card__title')[0].text_content()
excerpt= article.cssselect(' div.ibm--card__body > p.ibm--card__excerpt')[0].text
category= article.cssselect('div.ibm--card__bottom > p.cpt-byline__categories span')

#only two available on block: except '+'
#announcements.append([link,atype,adate,title,excerpt,[category[0].text,category[1].text]])

announcements.append([link,atype,adate,title,excerpt,[span.text for span in category if span.text!='+']])

print(announcements)

articles is a defined main CSS query and is looped for all available articles found in the page as article. Each article has different elements for type, date, title, category, and so on. Element data or attributes are collected using text, text_content(), and get(). cssselect returns the Python list objects, hence, indexing, such as [0], is used to collect particular element content.

category in the preceding code doesn't have any indexing, as it contains a multiple <span> element whose value is being extracted using a list comprehension technique, while appending or using indexing as shown in the comments. Output obtained for the code is shown in the following screenshot. Minor cleaning of data has been attempted, but the final list still contains the obtained raw data:

Output from list announcements obtained using lxml.cssselect
It's also to be noted that CSS selector queries copied or obtained using DevTools and used in the example code seem to be different in expression and length. DevTools-provided queries contain details and linked expressions from the parent element found for all chosen elements. In code, we have used the CSS query for only the particular elements identified.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.25.144