XPath

The XML Path (XPath) language is a part of XML-based technologies (XML, XSLT, and XQuery), which deal with navigating through DOM elements or locating nodes in XML (or HTML) documents using expressions also known as XPath expressions. XPath is normally a path that identifies nodes in documents. XPath is also a W3C (short for World Wide Web Consortium) recommendation (https://www.w3.org/TR/xpath/all/).

XPath or XPath expressions are also identified as absolute and relative:

The absolute path is an expression that represents a complete path from the root element to the desired element. It begins with /html and looks like /html/body/div[1]/div/div[1]/div/div[1]/div[2]/div[2]/div/span/b[1]. Individual elements are identified with their position and represented by an index number.
The relative path represents an expression chosen from certain selected elements to the desired element. Relative paths are shorter and readable in comparison to absolute paths and look like //*[@id="answer"]/div/span/b[@class="text"]. A relative path is often preferred over an absolute path as element indexes, attributes, logical expressions, and so on can be combined and articulated in a single expression.

With XPath expressions, we can navigate hierarchically through elements and reach the targeted one. XPath is also implemented by various programming languages, such as JavaScript, Java, PHP, Python, and C++. Web applications and browsers also have built-in support to XPath.

Expressions can be built using a number of built-in functions available for various data types. Operations related to general math (+, -, *, /), comparison (<, >, =, !=, >=, <=), and combination operators (and, or, and mod) can also be used to build expression. XPath is also a core block for XML technologies such as XQuery and eXtensible Stylesheet Language Transformations (XSLT).

XML Query (XQuery) is a query language that uses XPath expressions to extract data from XML document.
XSLT is used to render XML in a more readable format.

Let's explore a few XPath expressions from the XML content as seen in the following from the food.xml file:

XML content

In the following example, we will be using XPath-Tester from Code Beautify (https://codebeautify.org/Xpath-Tester). Use the XML source URL provided earlier to fetch the XML content and use it with the Code Beautify XPath-Tester.

You can use https://codebeautify.org/Xpath-Tester, https://www.freeformatter.com/xpath-tester.htm, or any other XPath tester tools that are available free on the web.

Everything is a node in an XML document, for example, menus, food, and price. An XML node can be an element itself (elements are types or entities that have start and end tags).

The preceding XML document can also be read as inherited element blocks. Parent node menus contain multiple child nodes food, which distinguishes child elements for appropriate values and proper data types. The XPath expression, //food, as shown in the following screenshot, displays the result for the selected node food. Node selection also retrieves the child nodes within the parents, as seen in the following screenshot:

Result for XPath //food (using https://codebeautify.org/Xpath-Tester)

The XPath expression in the following screenshot selects the child node, price, found inside all parent nodes food. There are six child food nodes available, each of them containing price, name, description, feedback, and rating:

Result for XPath //food/price (using https://codebeautify.org/Xpath-Tester)

As we can see from the two preceding XPaths tested, expressions are created almost like a filesystem (command line or Terminal path), which we use in various OS. XPath expressions contain code patterns, functions, and conditional statements and support the use of predicates.

Predicates are used to identify a specific node or element. Predicate expressions are written using square brackets that are similar to Python lists or array expressions.

A brief explanation of the XPath expression given in the preceding XML is listed in the following table:

XPath expression	Description
`//`	Selects nodes in the document, no matter where they are located
`//*`	Selects all elements in the document
`//food`	Selects the element `food`
`*`	Selects all elements
`//food/name \| //food/price`	Selects the `name` and `price` elements found in the `food` node: <name>Butter Milk with Vanilla</name> <name>Fish and Chips</name> <price>$5.50</price> <price>$2.99</price>
`//food/name`	Selects all the `name` elements inside `food`: <name>Butter Milk with Vanilla</name> <name>Eggs and Bacon</name> <name>Orange Juice</name>
`//food/name/text()`	Selects the `text` only for all `food/name` elements: Butter Milk with Vanilla Orange Juice
`//food/name \| //rating`	Selects all `name` elements from `food` and `rating` found in document: <name>Butter Milk with Vanilla</name> <name>Fish and Chips</name><rating>4.5</rating> <rating>4.9</rating>
`//food[1]/name`	Selects the `name` element for the first `food` node: <name>Butter Milk with Vanilla</name>
`//food[feedback<9]`	Select the `food` node and all of its elements where the predicate condition, `feedback<9`, is true: <food> <name>Butter Milk with Vanilla</name> <name>Egg Roll</name> <name>Eggs and Bacon</name> </food>
`//food[feedback<9]/name`	Selects the `food` node and the `name` element that matches the condition: <name>Butter Milk with Vanilla</name> <name>Egg Roll</name> <name>Eggs and Bacon</name>
`//food[last()]/name`	Selects the `name` element from the last `food` node: <name>Orange Juice</name>
`//food[last()]/name/text()`	Selects `text` for the `name` element from the last `food` node: Orange Juice
`sum(//food/feedback)`	Provides the sum of feedback found in all `food`:nodes: 47.0
`//food[rating>3 and rating<5]/name`	Selects the `name` of `food` that fulfills the predicate condition: <name>Egg Roll</name> <name>Eggs and Bacon</name> <name>Orange Juice</name>
`//food/name[contains(.,"Juice")]`	Selects the `name` of `food` that contains the `Juice` string: <name>Orange Juice</name>
`//food/description[starts-with(.,"Fresh")]/text()`	Selects the node description that starts with `Fresh`: Fresh egg rolls filled with ground chicken, ... cabbage Fresh Orange juice served
`//food/description[starts-with(.,"Fresh")]`	Selects `text` from `description` node that starts with `Fresh`: <description>Fresh egg rolls filled with.. cabbage</description> <description>Fresh Orange juice served</description>
`//food[position()<3]`	Selects the first and second food according to its position: <food> <name>Butter Milk with Vanilla</name> <price>$3.99</price> ... <rating>5.0</rating> <feedback>10</feedback> </food>

XPath predicates can contain a numeric index that starts from 1 (not 0) and conditional statements, for example, //food[1] or //food[last()]/price.

Now that we have tested the preceding XML with various XPath expressions, let's consider a simple XML with some attributes. Attributes are extra properties that identify certain parameters for a given node or element. A single element can contain a unique attributes set. Attributes found in XML nodes or HTML elements help to identify the unique element with the value it contains. As we can see in the code in the following XML, attributes are found as a key=value pair of information, for example id="1491946008":

<?xml version="1.0" encoding="UTF-8"?>
<books>
     <book id="1491946008" price='47.49'>
        <author>Luciano Ramalho</author>
         <title>
            Fluent Python: Clear, Concise, and Effective Programming
        </title>
     </book>
     <book id="1491939362" price='29.83'>
         <author>Allen B. Downey</author>
         <title>
 Think Python: How to Think Like a Computer Scientist
        </title>
     </book>
</books>

XPath expression accepts key attributes by adding the @ character in front of the key name. Listed in the following table are a few examples of XPath using attributes with a brief description.

XPath expression	Description
`//book/@price`	Selects the `price` attribute for a `book`: price="47.49" price="29.83"
`//book`	Selects the `book` field and its elements: <book id="1491946008" price="47.49"> <author>Luciano Ramalho</author> <title>Fluent Python: Clear, Concise, and Effective Programming Think Python: How to Think Like a Computer Scientist </title></book>
`//book[@price>30]`	Selects all elements in `book` the `price` attribute of which is greater than `30`: <book id="1491946008" price="47.49"> <author>Luciano Ramalho</author> <title>Fluent Python: Clear, Concise, and Effective Programming </title> </book>
`//book[@price<30]/title`	Selects `title` from books where the `price` attribute is less than `30`: <title>Think Python: How to Think Like a Computer Scientist</title>
`//book/@id`	Selects the `id` attribute and its value. The `//@id` expression also results in the same output: id="1491946008" id="1491939362"
`//book[@id=1491939362]/author`	Selects `author` from `book` where `id=1491939362`: <author>Allen B. Downey</author>

We have tried to explore and learn a few basic features about XPath and writing expressions to retrieve the desired content. In the Scraping using lxml - a Python library section, we will use Python programming libraries to further explore deploying code using XPath to scrape provided documents (XML or HTML) and learn to generate or create XPath expressions using browser tools. For more information on XPaths please refer to the links in the Further reading section.

Table of Contents for XPath

Create new playlist

Sign In

Sign Up

Table of Contents for
XPath