Scraping using lxml, a Python library

lxml is a XML toolkit, with a rich library set to process XML and HTML. lxml is preferred over other XML-based libraries in Python for its high speed and effective memory management. It also contains various other features to handle both small or large XML files. Python programmers use lxml to process XML and HTML documents. For more detailed information on lxml and its library support, please visit https://lxml.de/.

lxml provides native support to XPath and XSLT and is built on powerful C libraries: libxml2 and libxslt. Its library set is used normally with XML or HTML to access XPath, parsing, validating, serializing, transforming, and extending features from ElementTree (http://effbot.org/zone/element-index.htm#documentation). Parsing, traversing ElementTree, XPath, and CSS selector-like features from lxml makes it handy enough for a task such as web scraping. lxml is also used as a parser engine in Python Beautiful Soup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and pandas (https://pandas.pydata.org/).

Elements of a markup language such as XML and HTML have start and close tags; tags can also have attributes and contain other elements. ElementTree is a wrapper that loads XML files as trees of elements. The Python built-in library, ElementTree (etree), is used to search, parse elements, and build a document tree. Element objects also exhibit various accessible properties related to Python lists and dictionaries.

XSLT is a language to transform an XML document into HTML, XHML, text, and so on. XSLT uses XPath to navigate in XML documents. XSLT is a template type of structure that is used to transform XML document into new documents.

The lxml library contains important modules, as listed here:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.147.20