XPath

The XML Path (XPath) language is a part of XML-based technologies (XML, XSLT, and XQuery), which deal with navigating through DOM elements or locating nodes in XML (or HTML) documents using expressions also known as XPath expressions. XPath is normally a path that identifies nodes in documents. XPath is also a W3C (short for World Wide Web Consortium) recommendation (https://www.w3.org/TR/xpath/all/).

XPath or XPath expressions are also identified as absolute and relative:

  • The absolute path is an expression that represents a complete path from the root element to the desired element. It begins with /html and looks like /html/body/div[1]/div/div[1]/div/div[1]/div[2]/div[2]/div/span/b[1]. Individual elements are identified with their position and represented by an index number.
  • The relative path represents an expression chosen from certain selected elements to the desired element. Relative paths are shorter and readable in comparison to absolute paths and look like //*[@id="answer"]/div/span/b[@class="text"]. A relative path is often preferred over an absolute path as element indexes, attributes, logical expressions, and so on can be combined and articulated in a single expression.

With XPath expressions, we can navigate hierarchically through elements and reach the targeted one. XPath is also implemented by various programming languages, such as JavaScript, Java, PHP, Python, and C++. Web applications and browsers also have built-in support to XPath.

Expressions can be built using a number of built-in functions available for various data types. Operations related to general math (+, -, *, /), comparison (<, >, =, !=, >=, <=), and combination operators (and, or, and mod) can also be used to build expression. XPath is also a core block for XML technologies such as XQuery and eXtensible Stylesheet Language Transformations (XSLT).

XML Query (XQuery) is a query language that uses XPath expressions to extract data from XML document. 
XSLT is used to render XML in a more readable format.

Let's explore a few XPath expressions from the XML content as seen in the following from the food.xml file:

XML content

In the following example, we will be using XPath-Tester from Code Beautify (https://codebeautify.org/Xpath-Tester). Use the XML source URL provided earlier to fetch the XML content and use it with the Code Beautify XPath-Tester.

You can use https://codebeautify.org/Xpath-Testerhttps://www.freeformatter.com/xpath-tester.htm, or any other XPath tester tools that are available free on the web.

Everything is a node in an XML document, for example, menus, food, and price. An XML node can be an element itself (elements are types or entities that have start and end tags).

The preceding XML document can also be read as inherited element blocks. Parent node menus contain multiple child nodes food, which distinguishes child elements for appropriate values and proper data types. The XPath expression, //food, as shown in the following screenshot, displays the result for the selected node food. Node selection also retrieves the child nodes within the parents, as seen in the following screenshot:

Result for XPath //food (using https://codebeautify.org/Xpath-Tester)

The XPath expression in the following screenshot selects the child node, price, found inside all parent nodes food. There are six child food nodes available, each of them containing price, name, description, feedback, and rating:

Result for XPath //food/price (using https://codebeautify.org/Xpath-Tester)

As we can see from the two preceding XPaths tested, expressions are created almost like a filesystem (command line or Terminal path), which we use in various OS. XPath expressions contain code patterns, functions, and conditional statements and support the use of predicates.

Predicates are used to identify a specific node or element. Predicate expressions are written using square brackets that are similar to Python lists or array expressions.

A brief explanation of the XPath expression given in the preceding XML is listed in the following table:

XPath expression

Description

//

Selects nodes in the document, no matter where they are located

//*

Selects all elements in the document

//food

Selects the element food

*

Selects all elements

//food/name | //food/price

Selects the name and price elements found in the food node:

<name>Butter Milk with Vanilla</name>
<name>Fish and Chips</name>
<price>$5.50</price>
<price>$2.99</price>
//food/name

Selects all the name elements inside food:

<name>Butter Milk with Vanilla</name>
<name>Eggs and Bacon</name>
<name>Orange Juice</name>

//food/name/text()

Selects the text only for all food/name elements:

Butter Milk with Vanilla Orange Juice

//food/name | //rating

Selects all name elements from food and rating found in document:

<name>Butter Milk with Vanilla</name>
<name>Fish and Chips</name><rating>4.5</rating>
<rating>4.9</rating>

//food[1]/name

Selects the name element for the first food node:

<name>Butter Milk with Vanilla</name>

//food[feedback<9]

Select the food node and all of its elements where the predicate condition, feedback<9, is true:

<food>
<name>Butter Milk with Vanilla</name>
<name>Egg Roll</name>
<name>Eggs and Bacon</name>
</food>

//food[feedback<9]/name

Selects the food node and the name element that matches the condition:

<name>Butter Milk with Vanilla</name>
<name>Egg Roll</name>
<name>Eggs and Bacon</name>

//food[last()]/name

Selects the name element from the last food node:

<name>Orange Juice</name>

//food[last()]/name/text()

Selects text for the name element from the last food node:

Orange Juice

sum(//food/feedback)

Provides the sum of feedback found in all food:nodes:

47.0

//food[rating>3 and rating<5]/name

Selects the name of food that fulfills the predicate condition:

<name>Egg Roll</name>
<name>Eggs and Bacon</name>
<name>Orange Juice</name>

//food/name[contains(.,"Juice")]

Selects the name of food that contains the Juice string:

<name>Orange Juice</name>

//food/description[starts-with(.,"Fresh")]/text()

Selects the node description that starts with Fresh:

Fresh egg rolls filled with ground chicken, ... cabbage
Fresh Orange juice served

//food/description[starts-with(.,"Fresh")]

Selects text from description node that starts with Fresh:

<description>Fresh egg rolls filled with.. cabbage</description>
<description>Fresh Orange juice served</description>

//food[position()<3]

Selects the first and second food according to its position:

<food>
<name>Butter Milk with Vanilla</name>
<price>$3.99</price>
...
<rating>5.0</rating>
<feedback>10</feedback>
</food>
XPath predicates can contain a numeric index that starts from 1 (not 0) and conditional statements, for example, //food[1] or //food[last()]/price.

Now that we have tested the preceding XML with various XPath expressions, let's consider a simple XML with some attributes. Attributes are extra properties that identify certain parameters for a given node or element. A single element can contain a unique attributes set. Attributes found in XML nodes or HTML elements help to identify the unique element with the value it contains. As we can see in the code in the following XML, attributes are found as a key=value pair of information, for example id="1491946008":

<?xml version="1.0" encoding="UTF-8"?>
<books>
<book id="1491946008" price='47.49'>
<author>Luciano Ramalho</author>
<title>
Fluent Python: Clear, Concise, and Effective Programming
</title>
</book>
<book id="1491939362" price='29.83'>
<author>Allen B. Downey</author>
<title>
Think Python: How to Think Like a Computer Scientist
</title>
</book>
</books>

XPath expression accepts key attributes by adding the @ character in front of the key name. Listed in the following table are a few examples of XPath using attributes with a brief description.

XPath expression

Description

//book/@price

Selects the price attribute for a book:

price="47.49"
price="29.83"

//book

Selects the book field and its elements:

<book id="1491946008" price="47.49">

<author>Luciano Ramalho</author>
<title>Fluent Python: Clear, Concise, and Effective Programming
Think Python: How to Think Like a Computer Scientist
</title></book>

//book[@price>30]

Selects all elements in book the price attribute of which is greater than 30:

<book id="1491946008" price="47.49">
<author>Luciano Ramalho</author>
<title>Fluent Python: Clear, Concise, and Effective Programming </title> </book>

//book[@price<30]/title

Selects title from books where the price attribute is less than 30:

<title>Think Python: How to Think Like a Computer Scientist</title>

//book/@id

Selects the id attribute and its value. The //@id expression also results in the same output:

id="1491946008"
id="1491939362"

//book[@id=1491939362]/author

Selects author from book where id=1491939362:

<author>Allen B. Downey</author>

 

We have tried to explore and learn a few basic features about XPath and writing expressions to retrieve the desired content. In the Scraping using lxml - a Python library section, we will use Python programming libraries to further explore deploying code using XPath to scrape provided documents (XML or HTML) and learn to generate or create XPath expressions using browser tools. For more information on XPaths please refer to the links in the Further reading section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.107.191