Chapter 8. News – The Collective Social Media!

News is ubiquitous in nature, be it a breaking news flash, an opinion piece about the latest issues, or just a gossip-monger column on page 3 of your favorite daily. In the new age world of media such as Twitter, Facebook, and so on there is an indistinguishable line between what constitutes news and what constitutes social media content. We share the notion that the unique position news shares in today's electronic world makes it eligible to be termed social media. News can be described as a collective social media outlet, although each individual does not produce the news articles directly, but they collectively represent the beliefs, hopes, and dreams of the society.

In this chapter, we will try to go through the various steps that are involved in analyzing news from different sources. We will deal with the process of using news data and build use cases that will serve as an introduction for the complex analyzes that can be done on that data. We will be going through these broad topics as we progress through the chapter:

  • Identifying news sources that can be used for data collection
  • Gaining API access to the aforementioned news sources to facilitate large-scale data collection
  • Scraping article pages to extract information provided by APIs
  • Sentiment analysis of news articles to extract collective sentiments from a news database
  • Topic modeling for news articles to make sense of the vast amount of textual data in an easily palatable way

News data mostly comprises large bundles of text, and text mining/analytics is a whole area of study in itself. This chapter will serve as an introduction to that broad area of analyzes. An important takeaway from this chapter is the process of data collection from normal web pages. This procedure, called scraping, is a highly coveted skill for any data professional as it allows him/her to access data in the public domain, which is not often exposed through neat APIs.

News data – news is everywhere

A single search of the term "news" yields around 10,43,00,00,000 results, which clearly proves that there is no dearth of news data. A few credible news sources, such as The Guardian and The New York Times, have excellent APIs for data access:

News data – news is everywhere

The Guardian main page Image source: https://www.theguardian.com/international

All the major print newspapers have a very strong online presence. Most allow free access to their data, while some will typically charge you for accessing their news feed. But there are few that not only provide news data for free but also ensure easy access to that data through their APIs.

In addition to the individual newspaper websites, an excellent source of news data is a news aggregator. A news aggregator will keep track of news from different sources and will give you access to collective news sources instead of a single one. One such news aggregator is Metanews.

News data – news is everywhere

The metanews main page. Image source: http://metanews.com/

Accessing news data

The introduction to this section highlighted the abundance of news data, but the sad truth is that all that data is not easily available. News data is often commercial in nature and a lot of data providers don't open up access to their data to maintain a commercial edge. It becomes a huge problem for the data professional to access the data of interest. In theory, as the news data is in the public domain, that is, on the Internet, you can extract it and use it, but the complex process of web scraping alone makes it a tough ask.

This is where news providers such as The Guardian (https://www.theguardian.com/international) and The New York Times (https://www.nytimes.com/) stand out. They maintain an excellent set of APIs, which helps to ease access to their news data. For this chapter, we will use The Guardian's APIs and a little web scraping to get the required data. The APIs have an extensive set of documentation, which the reader can read through to get comfortable with it. The documentation is available at http://open-platform.theguardian.com/documentation/. We encourage the reader to go through it and, especially, to get acquainted with the terms of usage.

Creating applications for data access

The first step for securing access to any APIs is the creation of the developer account. We will do the same to get access credentials for our news access APIs. We will use The Guardian's APIs to get the required news data for our analysis.

The step-by-step process for creating a The Guardian developer account is illustrated as follows:

  1. Go to The Guardian open platform web page located at http://open-platform.theguardian.com/.
  2. Click on the Get Started link and then on the Register developer key page.
  3. This will get you to the following page, which needs you to fill in the required details and then click on Register. This will complete the registration process. Please see the following image for more details:
    Creating applications for data access

    The Guardian registration page

Once the registration process is complete, you will get your API key through an e-mail to the e-mail address that you gave on the registration page:

Creating applications for data access

E-mail with API key

This key will be required for each of your API calls. Please keep the key secret, as it is used for accounting purposes by the data provider.

Note

Important reminder for API access

One important aspect of API-driven data access is the applicable fair-use policy. Please get acquainted with the limits that apply to you, otherwise you risk restrictive actions against your account, including restrictions or even blocking of the account.

Data extraction – not just an API call

The data access procedure in our earlier excursions was relatively simple when compared to the access process required for extracting the news data. The procedure involves a two-step process (we will detail both of them later):

  1. In the first step, we will get the necessary data from the API call. Normally, this data will give us the URL of the news article.
  2. In the next step, we will use the URL to extract the textual data of the article.

The API call and JSON monster

The Guardian's APIs have an R wrapper around them, but it doesn't seem to work properly. So, we will borrow the API access mechanism that we developed in Chapter 4, Foursquare—Are You Checked in Yet?

To jog your memory about the steps involved in the process, here is a recap of them:

  1. Find the required end points for the required data.
  2. Construct the required URL for data access using the API key.
  3. Get the JSON response by querying the URL.
  4. Parse the JSON response into a tabular format.

The Guardian's API has a very friendly resource here, which will allow us to skip the process of tediously going through the documentation to arrive at the required URL. Go to the address http://open-platform.theguardian.com/explore/. You will be greeted with the following page:

The API call and JSON monster

Exploring The Guardian's API

This useful web page allows you to construct your access URL by inputting the various filters available. The available filters can be accessed by selecting them from the Add filters... option box. Adding a filter adds a corresponding text box to the page, which can be used to specify the value for the filter.

For an example, suppose we want to search for articles mentioning brexit in the Opinion section of The Guardian between 01-06-2016 and 25-06-2016 (please note the British date format of dd-mm-yyyy as this is a UK website). The required API end point URL can be constructed from the preceding page by selecting the filters and then specifying the values. Take a look at the following completed web page to get the hang of the process:

The API call and JSON monster

Interactively creating API URL

We can directly use the URL generated to make our API call. This interface makes the URL creation a very simple process. The next step is to query it and then parse the JSON response to extract the required information.

Once we have generated the URL, we make the API call and then use the JSON returned by the call to extract the data. Let's use the same example as previously to extract the required data:

# The uri extracted from the online page

URI = "http://content.guardianapis.com/search?from-date=1016-06-01&to-date=2016-06-25&section=commentisfree&q=brexit&api-key=xxxxxxxxxxxxx"

# Making the API call

json_res <- getURL(URI, .mapUnicode=TRUE)

Once we have the response from the API call, we again have to parse the JSON result to get the data. To parse a JSON, it is very important to know the structure of the object. Here we are in luck, as the same web page that helped us in creating the URL will also give a sample of results that we can use to find out the structure of the JSON response. The image shown here gives the structure of the API call we have just executed:

The API call and JSON monster

JSON structure for parsing

The structure of result is a very simple one, which is good news for us as it simplifies our data extraction logic. Please revisit the parsing process explained in an earlier chapter to find out how this JSON can be parsed.

To parse this JSON we need to enter the results object, gather all its elements as an array, and then extract all the fields as string fields. The code that will do this is quite straightforward and can be easily written based on our earlier excursions. But there is an extra complexity here. Typically, for an article search, you will get multiple results and not all of those results will be returned in a single API call. We would need to solve this problem if we want to extract all our news articles' information.

To do this we need to pay attention to the other fields in the JSON response; we have two parameters called pagesize and pages. The pagesize parameter tells us the number of responses in a page and pages parameter gives us the total number of pages. We can combine these two pieces of information and write code so that we iterate the required number of pages, making multiple calls, and combine all that information in to a resultant DataFrame, which will be our final output. Here is the code that will complete this process:

# Find out the number of pages

num_pages = json_res %>% 
   enter_object("response") %>% 
   spread_values(num_pages = jnumber("pages"))
num_pages = num_pages$num_pages

# Initialize an empty data frame

out_df = data.frame()
for (num in 1:num_pages){
    uriTemplate <- "http://content.guardianapis.com/search?from-date=1016-06-01&to-date=2016-06-25&section=commentisfree&q=brexit&api-key=xxxxxxxxxxxxxxxxxxxxxxxxx&page=%s"
    apiurl<- sprintf(uriTemplate,num)
    json_res <- getURL(apiurl, .mapUnicode=TRUE)
    urls <- as.data.frame(json_res %>% 
enter_object("response") %>% 
enter_object("results")  %>%
            gather_array() %>% 
                spread_values(url = jstring("webUrl"), 
                type =   jstring("type"), 
                sectionName = jstring("sectionName"), 
                webTitle = jstring("webTitle"),
                sectionId = jstring("sectionId")
                              ))
    urls$document.id <- NULL
    urls$array.index <- NULL
    out_df = rbind(out_df, urls)
    Sys.sleep(10)
}

An important part of the preceding code is the Sys.sleep(10); this code instruction makes the execution of our code pause for specified time period, here 10 seconds, which is important when we are querying a data provider repeatedly and want to avoid hitting their ceiling API call rate.

The preceding code will compile a DataFrame; a part of which is shown here:

The API call and JSON monster

DataFrame with the links data

HTML scraping from the links – the bigger monster

Even after the elaborate first step, we are still not close to extracting the textual data. The procedure for extracting text is easy to explain: iterate through the DataFrame, visit each URL, and extract the text data. Those who are familiar with the complexities of web page scraping will understand the sarcasm of the previous lines. We will be using the rvest, tidyjson, magrittr, and RCurl libraries for our scraping process, which we assume are installed on your system.

The general steps involved in any web scraping task are as follows:

  1. Extract the HTML source of the web page.
  2. Analyze the HTML source of the web page.
  3. Find tags of interest that contain the information.
  4. Extract the data for those tags programmatically and repeat the whole process.

The first step is extracting the HTML source of the page. If you look at the DataFrame of results that we extracted in the last step, it contains a column that will give us the URL of the news story. After that, we can read the read_html function to extract the HTML tree of the URL.

# Extracting the HTML data

b <- read_html(curl(url, handle = curl::new_handle("useragent" = "Mozilla/5.0"))) 

It is important to add the handle argument, as it identifies our system as a legitimate client and we avoid being timed out by the server.

The next step is the toughest part of the process. Every web page is unique and yet similar in some sense. As we are looking at web pages only from a single source, the web pages will be similar, but even with that information we will have to analyze the source to find out how we can extract the textual data. Take a look at the HTML source of one of the URLs that we extracted:

HTML scraping from the links – the bigger monster

Web source of a sample URL

You can observe that the HTML source of our URL is a scary sight, and it runs for some 2,000 lines. How can we ever get the required information from this huge garbled text?

This is the part of scraping where you go through the source looking for some clues, which can be used for the parsing. In our case, the clue is the text of the article. We also have access to the web page that is generated by this source. The web page will give us the text that we are looking for in our source. See the next image in which we have searched for some of the text snippets on the web page; it will reveal a very repeatable structure that we can use:

HTML scraping from the links – the bigger monster

Web source of a sample URL (continued)

If you look closely at the image, you will observe that all the textual data is contained within <p> (paragraph tag of HTML) tags. This is interesting information as this will make our parsing process a breeze.

Note

Please keep in mind that the HTML structures of web pages are not only unique but they are very dynamic in nature. A simple change to the structure can break our nicely written scraping routine. To fix any such changes, the process is same as before. You go fishing in the source and find out information that can be used for parsing.

Now that we know the tags of interest, we can easily find them out using the html_nodes function from the library:

# Find the nodes for paragraph tags

paragraph_nodes = html_nodes(b, xpath = ".//p")

The preceding line of code will search for all paragraph nodes in the tree structure of the HTML source and extract them for processing. Once you have the nodes extracted, the next step is super simple:

# Tidy up by removing whitespaces, newlines and collapsing all nodes into a single node

nodes<-trimws(html_text(paragraph_nodes))
nodes<-gsub("
", "", nodes)
nodes<-gsub("  ", "", nodes)
content = paste(nodes, collapse = " ")

Now we have the textual content of the web page extracted. The following image gives the extracted textual content:

HTML scraping from the links – the bigger monster

Extracted textual content

We can iterate the whole process for all of the DataFrame to end up with the content for each of our URLs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.121.86