Chapter 10. Clustering News Articles

In most of the previous chapters, we performed data mining knowing what we were looking for. Our use of target classes allowed us to learn how our variables model those targets during the training phase. This type of learning, where we have targets to train against, is called supervised learning. In this chapter, we consider what we do without those targets. This is unsupervised learning and is much more of an exploratory task. Rather than wanting to classify with our model, the goal in unsupervised learning is more about exploring the data to find insights.

In this chapter, we look at clustering news articles to find trends and patterns in the data. We look at how we can extract data from different websites using a link aggregation website to show a variety of news stories.

The key concepts covered in this chapter include:

  • Obtaining text from arbitrary websites
  • Using the reddit API to collect interesting news stories
  • Cluster analysis for unsupervised data mining
  • Extracting topics from documents
  • Online learning for updating a model without retraining it
  • Cluster ensembling to combine different models

Obtaining news articles

In this chapter, we will build a system that takes a live feed of news articles and groups them together, where the groups have similar topics. You could run the system over several weeks (or longer) to see how trends change over that time.

Our system will start with the popular link aggregation website reddit, which stores lists of links to other websites, as well as a comments section for discussion. Links on reddit are broken into several categories of links, called subreddits. There are subreddits devoted to particular TV shows, funny images, and many other things. What we are interested in is the subreddits for news. We will use the /r/worldnews subreddit in this chapter, but the code should work with any other subreddit.

In this chapter, our goal is to download popular stories, and then cluster them to see any major themes or concepts that occur. This will give us an insight into the popular focus, without having to manually analyze hundreds of individual stories.

Using a Web API to get data

We have used web-based APIs to extract data in several of our previous chapters. For instance, in Chapter 7, Discovering Accounts to Follow Using Graph Mining, we used Twitter's API to extract data. Collecting data is a critical part of the data mining pipeline, and web-based APIs are a fantastic way to collect data on a variety of topics.

There are three things you need to consider when using a web-based API for collecting data: authorization methods, rate limiting, and API endpoints.

Authorization methods allow the data provider to know who is collecting the data, in order to ensure that they are being appropriately rate-limited and that data access can be tracked. For most websites, a personal account is often enough to start collecting data, but some websites will ask you to create a formal developer account to get this access.

Rate limiting is applied to data collection, particularly free services. It is important to be aware of the rules when using APIs, as they can and do change from website to website. Twitter's API limit is 180 requests per 15 minutes (depending on the particular API call). Reddit, as we will see later, allows 30 requests per minute. Other websites impose daily limits, while others limit on a per-second basis. Even within websites, there are drastic differences for different API calls. For example, Google Maps has smaller limits and different API limits per-resource, with different allowances for the number of requests per hour.

Note

If you find you are creating an app or running an experiment that needs more requests and faster responses, most API providers have commercial plans that allow for more calls.

API Endpoints are the actual URLs that you use to extract information. These vary from website to website. Most often, web-based APIs will follow a RESTful interface (short for Representational State Transfer). RESTful interfaces often use the same actions that HTTP does: GET, POST, and DELETE are the most common. For instance, to retrieve information on a resource, we might use the following API endpoint: www.dataprovider.com/api/resource_type/resource_id/.

To get the information, we just send a HTTP GET request to this URL. This will return information on the resource with the given type and ID. Most APIs follow this structure, although there are some differences in the implementation. Most websites with APIs will have them appropriately documented, giving you details of all the APIs that you can retrieve.

First, we set up the parameters to connect to the service. To do this, you will need a developer key for reddit. In order to get this key, log in to the https://www.reddit.com/login website and go to https://www.reddit.com/prefs/apps. From here, click on are you a developer? create an app… and fill out the form, setting the type as script. You will get your client ID and a secret, which you can add to a new IPython Notebook:

CLIENT_ID = "<Enter your Client ID here>"
CLIENT_SECRET = "<Enter your Client Secret here>"

Reddit also asks that, when you use their API, you set the user agent to a unique string that includes your username. Create a user agent string that uniquely identifies your application. I used the name of the book, chapter 10, and a version number of 0.1 to create my user agent, but it can be any string you like. Note that not doing this will result in your connection being heavily rate-limited:

 USER_AGENT = "python:<your unique user agent> (by /u/<your reddit username>)"

In addition, you will need to log into reddit using your username and password. If you don't have one already, sign up for a new one (it is free and you don't need to verify with personal information either).

Tip

You will need your password to complete the next step, so be careful before sharing your code to others to remove it. If you don't put your password in, set it to none and you will be prompted to enter it. However, due to the way IPython Notebooks work, you'll need to enter it into the command-line terminal that started the IPython server, not the notebook itself. If you can't do this, you'll need to set it in the script. The developers of the IPython Notebook are working on a plugin to fix this, but it was not yet available at the time of writing.

Now let's create the username and password:

USERNAME = "<your reddit username>"
PASSWORD = "<your reddit password>"

Next, we are going to create a function to log with this information. The reddit login API will return a token that you can use for further connections, which will be the result of this function. The code is as follows:

def login(username, password):

First, if you don't want to add your password to the script, you can set it to None and you will be prompted, as explained previously. The code is as follows:

    if password is None:
        password = getpass.getpass("Enter reddit password for user {}: ".format(username))

It is very important that you set the user agent to a unique value, or your connection might be severely restricted. The code is as follows:

    headers = {"User-Agent": USER_AGENT}

Next, we set up a HTTP authorization object to allow us to login at reddit's servers:

    client_auth = requests.auth.HTTPBasicAuth(CLIENT_ID, CLIENT_SECRET)

To login, we make a POST request to the access_token endpoint. The data we send is our username and password, along with the grant type that is set to password for this example:

    post_data = {"grant_type": "password", "username": username, "password": password}

Finally, we use the requests library to make the login request (this is done via a HTTP POST request) and return the result, which is a dictionary of values. One of these values is the token we will need for future requests. The code is as follows:

    response = requests.post("https://www.reddit.com/api/v1/access_token", auth=client_auth, data=post_data, headers=headers)
    return response.json()

We can call now our function to get a token:

token = login(USERNAME, PASSWORD)

The token object is just a dictionary, but it contains the access_token string that we will pass with future requests. It also contains other information such as the scope of the token (which would be everything) and the time in which it expires—for example:

{'access_token': '<semi-random string>',  'expires_in': 3600,  'scope': '*',  'token_type': 'bearer'}

Reddit as a data source

Reddit (www.reddit.com) is a link aggregation website used by millions worldwide, although the English versions are US-centric. Any user can contribute a link to a website they found interesting, along with a title for that link. Other users can then upvote it, indicating that they liked the link, or downvote it, indicating they didn't like the link. The highest voted links are moved to the top of the page, while the lower ones are not shown. Older links are removed over time (depending on how many upvotes it has). Users who have stories upvoted earn points called karma, providing an incentive to submit only good stories.

Reddit also allows nonlink content, called self-posts. These contain a title and some text that the submitter enters. These are used for asking questions and starting discussions, but do not count towards a person's karma. For this chapter, we will be considering only link-based posts, and not comment-based posts.

Posts are separated into different sections of the website called subreddits. A subreddit is a collection of posts that are related. When a user submits a link to reddit, they choose which subreddit it goes into. Subreddits have their own administrators, and have their own rules about what is valid content for that subreddit.

By default, posts are sorted by Hot, which is a function of the age of a post, the number of upvotes, and the number of downvotes it has received. There is also New, which just gives you the most recently posted stories (and therefore contains lots of spam and bad posts), and Top, which is the highest voted stories for a given time period. In this chapter, we will be using Hot, which will give us recent, higher-quality stories (there really are a lot of poor-quality links in New).

Using the token we previously created, we can now obtain sets of links from a subreddit. To do that, we will use the /r/<subredditname> API endpoint that, by default, returns the Hot stories. We will use the /r/worldnews subreddit:

subreddit = "worldnews"

The URL for the previous end-point lets us create the full URL, which we can set using string formatting:

url = "https://oauth.reddit.com/r/{}".format(subreddit)

Next, we need to set the headers. This is needed for two reasons: to allow us to use the authorization token we received earlier and to set the user agent to stop our requests from being heavily restricted. The code is as follows:

headers = {"Authorization": "bearer {}".format(token['access_token']),
           "User-Agent": USER_AGENT}

Then, as before, we use the requests library to make the call, ensuring that we set the headers:

response = requests.get(url, headers=headers)

Calling json() on this will result in a Python dictionary containing the information returned by Reddit. It will contain the top 25 results from the given subreddit. We can get the title by iterating over the stories in this response. The stories themselves are stored under the dictionary's data key. The code is as follows:

for story in result['data']['children']:
    print(story['data']['title'])

Getting the data

Our dataset is going to consist of posts from the Hot list of the /r/worldnews subreddit. We saw in the previous section how to connect to reddit and how to download links. To put it all together, we will create a function that will extract the titles, links, and score for each item in a given subreddit.

We will iterate through the subreddit, getting a maximum of 100 stories at a time. We can also do pagination to get more results. We can read a large number of pages before reddit will stop us, but we will limit it to 5 pages.

As our code will be making repeated calls to an API, it is important to remember to rate-limit our calls. To do so, we will need the sleep function:

from time import sleep

Our function will accept a subreddit name and an authorization token. We will also accept a number of pages to read, although we will set a default of 5:

def get_links(subreddit, token, n_pages=5):

We then create a list to store the stories in:

    stories = []

We saw in Chapter 7, Discovering Accounts to Follow Using Graph Mining, how pagination works for the Twitter API. We get a cursor with our returned results, which we send with our request. Twitter will then use this cursor to get the next page of results. The reddit API does almost exactly the same thing, except it calls the parameter after. We don't need it for the first page, so we initially set it to none. We will set it to a meaningful value after our first page of results. The code is as follows:

    after = None

We then iterate for the number of pages we want to return:

    for page_number in range(n_pages):

Inside the loop, we initialize our URL structure as we did before:

        headers = {"Authorization": "bearer {}".format(token['access_token']),
            "User-Agent": USER_AGENT}
        url = "https://oauth.reddit.com/r/{}?limit=100".format(subreddit)

From the second loop onwards, we need to set the after parameter (otherwise, we will just get multiple copies of the same page of results). This value will be set in the previous iteration of the loop—the first loop sets the after parameter for the second loop and so on. If present, we append it to the end of our URL, telling reddit to get us the next page of data. The code is as follows:

        if after:
            url += "&after={}".format(after)

Then, as before, we use the requests library to make the call and turn the result into a Python dictionary using json():

        response = requests.get(url, headers=headers)
        result = response.json()

This result will give us the after parameter for the next time the loop iterates, which we can now set as follows:

        after = result['data']['after']

We then sleep for 2 seconds to avoid exceeding the API limit:

        sleep(2)

As the last action inside the loop, we get each of the stories from the returned result and add them to our stories list. We don't need all of the data—we only get the title, URL, and score. The code is as follows:

        stories.extend([(story['data']['title'], story['data']['url'], story['data']['score'])
                       for story in result['data']['children']])

Finally (and outside the loop), we return all the stories we have found:

    return stories

Calling the stories function is a simple case of passing the authorization token and the subreddit name:

stories = get_links("worldnews", token)

The returned results should contain the title, URL, and 500 stories, which we will now use to extract the actual text from the resulting websites.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.224.226