HTML elements and DOM navigation

We will be using http://books.toscrape.com/ from http://toscrape.com/. toscrape provides resources related to web scraping for beginners and developers to learn and implement Scraper.

Let's open the http://books.toscrape.com URL using the web browser, Google Chrome, as shown here:

Inspect view of books.toscrape.com

As the page content is successfully loaded, we can load DevTools with a right-click on the page and press the option Inspect or by pressing Ctrl + Shift + I. If accessing through the Chrome menu, click More Tools and Developer Tools. The browser should look similar to the content in the preceding screenshot.

As you can see in the preceding screenshot, in inspect mode, the following is loaded:

  • Panel elements are default on the left-hand side.
  • CSS styles-based content is on the right-hand side.
  • We notice the DOM navigation or elements path in the bottom left-hand corner, for example, html.no-js body .... div.page_inner div.row.

We have covered a basic overview of such panels in Chapter 1, Web Scraping Fundamentals, in the Developer Tools section. As developer tools get loaded, we can find a pointer-icon listed, at first, from the left; this is used for selecting elements from the page, as shown in the following screenshot; this element selector (inspector) can be turned ON/OFF using Ctrl + Shift + C:

Element selector (inspector) on inspect bar

We can move the mouse on the page loaded after turning ON the element selector. Basically, we are searching for the exact HTML element that we are pointing to using the mouse:

Using element selector on the book image

As seen in the preceding screenshot, the element has been selected and, as we move the mouse over the first book picture available, this action results in the following:

  • The div.image_container element is displayed and selected in the page itself.
  • Inside the elements panel source, we can find the particular HTML code, <div class="image_container">, being highlighted too. This information (where the book picture is located) can also be found using right-click + page source or Ctrl + U and searching for the specific content.

The same action can be repeated for various sections of HTML content that we wish to scrape, as in the following examples:

  • The price for a listed book is found inside the div.product_price element.
  • The star-rating is found inside p.star-rating.
  • The book title is found inside <h3>, found before div.product_price or after p.star-rating.
  • The book detail link is found inside <a>, which exists inside <h3>.
  • From the following screenshot, it's also clear that the previously listed elements are all found inside article.product_prod. Also, at the bottom of the following screenshot, we can identify the DOM path as article.product_prod:

 

Element selection under inspect mode

DOM navigation, as found in the preceding screenshots, can be beneficial while dealing with XPath expressions, and can verify the content using the page source, if the path or element displayed by the element inspector actually exists (inside the obtained page source).

DOM elements, navigation paths, and elements found using the elements inspector or selectors should be cross-verified for their existence in page sources or inside resources that are found in Network panels, to be sure.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.236.82