Chapter 10. Scraping JavaScript

Client-side scripting languages are languages that are run in the browser itself, rather than on a web server. The success of a client-side language depends on your browser’s ability to interpret and execute the language correctly. (This is why it’s so easy to disable JavaScript in your browser.)

Partly due to the difficulty of getting every browser manufacturer to agree on a standard, there are far fewer client-side languages than there are server-side languages. This is a good thing when it comes to web scraping: the fewer languages there are to deal with the better. 

For the most part, there are only two languages you’ll frequently encounter online: ActionScript (which is used by Flash applications) and JavaScript. ActionScript is used far less frequently today than it was 10 years ago, and is often used to stream multimedia files, as a platform for online games, or to display “intro” pages for websites that haven’t gotten the hint that no one wants to watch an intro page. At any rate, because there isn’t much demand for scraping Flash pages, I will instead focus on the client-side language that’s ubiquitous in modern web pages: JavaScript.

JavaScript is, by far, the most common and most well-supported client-side scripting language on the Web today. It can be used to collect information for user tracking, submit forms without reloading the page, embed multimedia, and even power entire online games. Even deceptively simple-looking pages can often contain multiple pieces of JavaScript. You can find it embedded between <script> tags in the page’s source code:

<script>
    alert("This creates a pop-up using JavaScript");
</script>

A Brief Introduction to JavaScript

Having at least some idea of what is going on in the code you are scraping can be immensely helpful. With that in mind it’s a good idea to familiarize yourself with JavaScript.

JavaScript is a weakly typed language, with a syntax that is often compared to C++ and Java. Although certain elements of the syntax, such as operators, loops, and arrays, might be similar, the weak typing and script-like nature of the language can make it a difficult beast to deal with for some programmers.

For example, the following recursively calculates values in the Fibonacci sequence, and prints them out to the browser’s developer console: 

<script> 
function fibonacci(a, b){ 
    var nextNum = a + b; 
    console.log(nextNum+" is in the Fibonacci sequence"); 
    if(nextNum < 100){ 
        fibonacci(b, nextNum); 
    } 
} 
fibonacci(1, 1);
</script>

Notice that all variables are demarcated by a preceding var. This is similar to the $ sign in PHP, or the type declaration (intStringList, etc.) in Java or C++. Python is unusual in that it doesn’t have this sort of explicit variable declaration.

JavaScript is also extremely good at passing around functions just like variables:

<script>
var fibonacci = function() { 
    var a = 1; 
    var b = 1; 
    return function () { 
        var temp = b; 
        b = a + b; 
        a = temp; 
        return b; 
    } 
}
var fibInstance = fibonacci();
console.log(fibInstance()+" is in the Fibonacci sequence"); 
console.log(fibInstance()+" is in the Fibonacci sequence"); 
console.log(fibInstance()+" is in the Fibonacci sequence");
</script>

This might seem daunting at first, but it becomes simple if you think in terms of lambda expressions (covered in Chapter 2). The variable fibonacci is defined as a function. The value of its function returns a function that prints increasingly large values in the Fibonacci sequence. Each time it is called, it returns the Fibonacci-calculating function, which executes again and increases the values in the function. 

Although it might seem convoluted at first glance, some problems, such as calculating Fibonacci values, tend to lend themselves to patterns like this. Passing around functions as variables is also extremely useful when it comes to handling user actions and callbacks, and it is worth getting comfortable with this style of programming when it comes to reading JavaScript.

Common JavaScript Libraries

Although the core JavaScript language is important to know, you can’t get very far on the modern Web without using at least one of the language’s many third-party libraries. You might see one or more of these commonly used libraries when looking at page source code.

Executing JavaScript using Python can be extremely time consuming and processor intensive, especially if you’re doing it on a large scale. Knowing your way around JavaScript and being able to parse it directly (without needing to execute it to acquire the information) can be extremely useful and save you a lot of headaches. 

jQuery

jQuery is an extremely common library, used by 70% of the most popular Internet sites and about 30% of the rest of the Internet.1 A site using jQuery is readily identifiable because it will contain an import to jQuery somewhere in its code, such as:

<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></
 script>

If you find jQuery is found on a site, you must be careful when scraping it. jQuery is adept at dynamically creating HTML content that appears only after the JavaScript is executed. If you scrape the page’s content using traditional methods, you will retrieve only the preloaded page that appears before the JavaScript has created the content (we’ll cover this scraping problem in more detail in “Ajax and Dynamic HTML”).

In addition, these pages are more likely to contain animations, interactive content, and embedded media that might make scraping challenging.

Google Analytics

Google Analytics is used by about 50% of all websites,2 making it perhaps the most common JavaScript library and the most popular user tracking tool on the Internet. In fact, both http://pythonscraping.com and http://www.oreilly.com/ use Google Analytics.

It’s easy telling whether a page is using Google Analytics. It will have JavaScript at the bottom similar to the following (taken from the O’Reilly Media site):

<!-- Google Analytics -->
<script type="text/javascript">

var _gaq = _gaq || []; 
_gaq.push(['_setAccount', 'UA-4591498-1']);
_gaq.push(['_setDomainName', 'oreilly.com']);
_gaq.push(['_addIgnoredRef', 'oreilly.com']);
_gaq.push(['_setSiteSpeedSampleRate', 50]);
_gaq.push(['_trackPageview']);

(function() { var ga = document.createElement('script'); ga.type =
'text/javascript'; ga.async = true; ga.src = ('https:' ==
document.location.protocol ? 'https://ssl' : 'http://www') +
'.google-analytics.com/ga.js'; var s =
document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(ga, s); })();

</script>

This script handles Google Analytics, specific cookies used to track your visit from page to page. This can sometimes be a problem for web scrapers that are designed to execute JavaScript and handle cookies (such as those that use Selenium, discussed later in this chapter). 

If a site uses Google Analytics or a similar web analytics system and you do not want the site to know that it’s being crawled or scraped, make sure to discard any cookies used for analytics or discard cookies altogether.

Google Maps

If you’ve spent any time on the Internet, you’ve almost certainly seen Google Maps embedded in a website. Its API makes it extremely easy to embed maps with custom information on any site. 

If you’re scraping any sort of location data, understanding how Google Maps works makes it easy to obtain well-formatted latitude/longitude coordinates and even addresses. One of the most common ways to denote a location in Google Maps is through a marker (also known as a pin).

Markers can be inserted into any Google Map using code such as the following:

var marker = new google.maps.Marker({
      position: new google.maps.LatLng(-25.363882,131.044922),
      map: map,
      title: 'Some marker text'
  });

Python makes it easy to extract all instances of coordinates that occur between google.maps.LatLng( and ) to obtain a list of latitude/longitude coordinates.

Using Google’s “reverse Geocoding” API, you can resolve these coordinate pairs to addresses that are well formatted for storage and analysis.

Ajax and Dynamic HTML

Until now the only way we’ve had of communicating with a web server is to send it some sort of HTTP request via the retrieval of a new page. If you’ve ever submitted a form or retrieved information from a server without reloading the page, you’ve likely used a website that uses Ajax

Contrary to what some believe, Ajax is not a language but a group of technologies used to accomplish a certain task (much like web scraping, come to think of it). Ajax stands for Asynchronous JavaScript and XML, and is used to send information to and receive from a web server without making a separate page request. Note: you should never say, “This website will be written in Ajax.” It would be correct to say, “This form will use Ajax to communicate with the web server.”

Like Ajax, dynamic HTML or DHTML is a collection of technologies used for a common purpose. DHTML is HTML code, CSS language, or both that change due to client-side scripts changing HTML elements on the page. A button might appear only after the user moves the cursor, a background color might change on a click, or an Ajax request might trigger a new block of content to load. 

Note that although the word “dynamic” is generally associated with words like “moving,” or “changing,” the presence of interactive HTML components, moving images, or embedded media does not necessarily make a page DHTML, even though it might look dynamic. In addition, some of the most boring, static-looking pages on the Internet can have DHTML processes running behind the scenes that depend on the use of JavaScript to manipulate the HTML and CSS. 

If you scrape a large number of different websites, you will soon run into a situation in which the content you are viewing in your browser does not match the content you see in the source code you’re retrieving from the site. You might view the output of your scraper and scratch your head, trying to figure out where everything you’re seeing on the exact same page in your browser has disappeared to.

The web page might also have a loading page that appears to redirect you to another page of results, but you’ll notice that the page’s URL never changes when this redirect happens.

Both of these are caused by a failure of your scraper to execute the JavaScript that is making the magic happen on the page. Without the JavaScript, the HTML just sort of sits there, and the site might look very different than what it looks like in your web browser, which executes the JavaScript without problem.

There are several giveaways that a page might be using Ajax or DHTML to change/load the content, but in situations like this there are only two solutions: scrape the content directly from the JavaScript, or use Python packages capable of executing the JavaScript itself, and scrape the website as you view it in your browser. 

Executing JavaScript in Python with Selenium

Selenium is a powerful web scraping tool developed originally for website testing. These days it’s also used when the accurate portrayal of websites—as they appear in a browser—is required. Selenium works by automating browsers to load the website, retrieve the required data, and even take screenshots or assert that certain actions happen on the website. 

Selenium does not contain its own web browser; it requires integration with third-party browsers in order to run. If you were to run Selenium with Firefox, for example, you would literally see a Firefox instance open up on your screen, navigate to the website, and perform the actions you had specified in the code. Although this might be neat to watch, I prefer my scripts to run quietly in the background, so I use a tool called PhantomJS in lieu of an actual browser. 

PhantomJS is what is known as a “headless” browser. It loads websites into memory and executes JavaScript on the page, but does it without any graphic rendering of the website to the user. By combining Selenium with PhantomJS, you can run an extremely powerful web scraper that handles cookies, JavaScript, headers, and everything else you need with ease.

You can download the Selenium library from its website or use a third-party installer such as pip to install it from the command line.

PhantomJS can be downloaded from its website. Because PhantomJS is a full (albeit headless) browser and not a Python library, it does require a download and installation to use and cannot be installed with pip.

Although there are plenty of pages that use Ajax to load data (notably Google), I’ve created a sample page at http://bit.ly/1HYuH0L to run our scrapers against. This page contains some sample text, hardcoded into the page’s HTML, that is replaced by Ajax-generated content after a two-second delay. If we were to scrape this page’s data using traditional methods, we’d only get the loading page, without actually getting the data that we want.

The Selenium library is an API called on the WebDriver. The WebDriver is a bit like a browser in that it can load websites, but it can also be used like a BeautifulSoup object to find page elements, interact with elements on the page (send text, click, etc.), and do other actions to drive the web scrapers. 

The following code retrieves text behind an Ajax “wall” on the test page:

from selenium import webdriver
import time

driver = webdriver.PhantomJS(executable_path='')
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
time.sleep(3)
print(driver.find_element_by_id("content").text)
driver.close()

This creates a new Selenium WebDriver, using the PhantomJS library, which tells the WebDriver to load a page and then pauses execution for three seconds before looking at the page to retrieve the (hopefully loaded) content. 

Depending on the location of your PhantomJS installation, you might also need to explicitly point Selenium in the right direction when creating a new PhantomJS WebDriver:

driver = webdriver.PhantomJS(executable_path='/path/to/download/
                             phantomjs-1.9.8-macosx/bin/phantomjs')

If everything is configured correctly the script should take a few seconds to run and result in the following text:

Here is some important text you want to retrieve!
A button to click!

Note that although the page itself contains an HTML button, Selenium’s .text function retrieves the text value of the button in the same way that it retrieves all other content on the page. 

If the time.sleep pause is changed to one second instead of three, the text returned changes to the original:

This is some content that will appear on the page while it's loading.
 You don't care about scraping this.

Although this solution works, it is somewhat inefficient and implementing it could cause problems on a large scale. Page load times are inconsistent, depending on the server load at any particular millisecond, and natural variations occur in connection speed. Although this page load should take just over two seconds, we’re giving it an entire three seconds to make sure that it loads completely. A more efficient solution would repeatedly check for the existence of some element on a fully loaded page and return only when that element exists. 

This code uses the presence of the button with id loadedButton to declare that the page has been fully loaded: from selenium import webdriver.

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.PhantomJS(executable_path='')
driver.get("http://pythonscraping.com/pages/javascript/ajaxDemo.html")
try:
    element = WebDriverWait(driver, 10).until(
                       EC.presence_of_element_located((By.ID, "loadedButton")))
finally:
    print(driver.find_element_by_id("content").text)
    driver.close()

There are several new imports in this script, most notably WebDriverWait and expected_conditions, both of which are combined here to form what Selenium calls an implicit wait.

An implicit wait differs from an explicit wait in that it waits for some state in the DOM to occur before continuing, while an explicit wait defines a hardcoded time like in the previous example, which has a wait of three seconds. In an implicit wait, the triggering DOM state is defined by expected_condition (note that the import is cast to EC here, a common convention used for brevity). Expected conditions can be many things in the Selenium library, among them: 

  • An alert box pops up
  • An element (such as a text box) is put into a “selected” state
  • The page’s title changes, or some text is now displayed on the page or in a specific element
  • An element is now visible to the DOM, or an element disappears from the DOM

Of course, most of these expected conditions require that you specify an element to watch for in the first place. Elements are specified using locators. Note that locators are not the same as selectors (see previous sidebar for more on selectors). A locator is an abstract query language, using the By object, which can be used in a variety of ways, including to make selectors. 

In the following example code, a locator is used to find elements with the id loadedButton:

EC.presence_of_element_located((By.ID, "loadedButton"))

Locators can also be used to create selectors, using the find_element WebDriver function:

print(driver.find_element(By.ID, "content").text)

Which is, of course, functionally equivalent to the line in the example code:

print(driver.find_element_by_id("content").text)

If you do not need to use a locator, don’t; it will save you an import. However, it is a very handy tool that is used for a variety of applications and has a great degree of flexibility. 

The following locator selection strategies can used with the By object:

ID
Used in the example; finds elements by their HTML id attribute.
CLASS_NAME
Used to find elements by their HTML class attribute. Why is this function CLASS_NAME and not simply CLASS? Using the form object.CLASS would create problems for Selenium’s Java library, where .class is a reserved method. In order to keep the Selenium syntax consistent between different languages, CLASS_NAME was used instead.
CSS_SELECTOR
Find elements by their class, id, or tag name, using the #idName.className, tagName convention.
LINK_TEXT
Finds HTML <a> tags by the text they contain. For example, a link that says “Next” can be selected using (By.LINK_TEXT, "Next").
PARTIAL_LINK_TEXT
Similar to LINK_TEXT, but matches on a partial string.
NAME
Finds HTML tags by their name attribute. This is handy for HTML forms.
TAG_NAME
Finds HTML tags by their tag name.
XPATH
Uses an XPath expression (the syntax of which is described in the upcoming sidebar) to select matching elements.

Handling Redirects

Client-side redirects are page redirects that are executed in your browser by JavaScript, rather than a redirect performed on the server, before the page content is sent. It can sometimes be tricky to tell the difference when visiting a page in your web browser. The redirect might happen so fast that you don’t notice any delay in loading time and assume that a client-side redirect is actually a server-side redirect.

However, when scraping the Web, the difference is obvious. A server-side redirect, depending on how it is handled, can be easily traversed by Python’s urllib library without any help from Selenium (for more information on doing this, see Chapter 3). Client-side redirects won’t be handled at all unless something is actually executing the JavaScript. 

Selenium is capable of handling these JavaScript redirects in the same way that it handles other JavaScript execution; however, the primary issue with these redirects is when to stop page execution—that is, how to tell when a page is done redirecting. A demo page at http://bit.ly/1SOGCBn gives an example of this type of redirect, with a two-second pause. 

We can detect that redirect in a clever way by “watching” an element in the DOM when the page initially loads, then repeatedly calling that element until Selenium throws a StaleElementReferenceException; that is, the element is no longer attached to the page’s DOM and the site has redirected:

from selenium import webdriver
import time
from selenium.webdriver.remote.webelement import WebElement
from selenium.common.exceptions import StaleElementReferenceException

def waitForLoad(driver):
    elem = driver.find_element_by_tag_name("html")
    count = 0
    while True:
        count += 1
        if count > 20:
            print("Timing out after 10 seconds and returning")
            return
        time.sleep(.5)
        try:
            elem == driver.find_element_by_tag_name("html")
        except StaleElementReferenceException:
            return

driver = webdriver.PhantomJS(executable_path='<Path to Phantom JS>')
driver.get("http://pythonscraping.com/pages/javascript/redirectDemo1.html")
waitForLoad(driver)
print(driver.page_source)

This script checks the page every half second, with a timeout of 10 seconds, although the times used for the checking time and timeout can be easily adjusted up or down as needed.

1 Dave Methvin’s blog post, "The State of jQuery 2014,” January 13, 2014, contains a more detailed breakdown of the statistics.

2 W3Techs, “Usage Statistics and Market Share of Google Analytics for Websites” (http://w3techs.com/technologies/details/ta-googleanalytics/all/all).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.121.156