Using the Embedly API to download story bodies

We have all the URLs for our stories, but, unfortunately, this isn't enough to train on; we'll need the full article body. This in itself could become a huge challenge if we want to roll our own scraper, especially if we are going to be pulling stories from dozens of sites. We would need to write code to target the article body while carefully avoiding all the other site gunk that surrounds it. Fortunately, as far as we are concerned, there are a number of free services that will do this for us. I'm going to be using Embedly to do this, but there are a number of other services that you could use instead.

The first step is to sign up for Embedly API access. You can do that at https://app.embed.ly/signup. It is a straightforward process. Once you confirm your registration, you will receive an API key. That's really all you'll need. You'll just use that key in your HTTP request. Let's do that now:

import urllib 
 
EMBEDLY_KEY = 'your_embedly_api_key_here' 
 
def get_html(x): 
    try: 
        qurl = urllib.parse.quote(x) 
        rhtml = requests.get('https://api.embedly.com/1/extract?url=' + qurl + '&key=' + EMBEDLY_KEY) 
        ctnt = json.loads(rhtml.text).get('content') 
    except: 
        return None 
    return ctnt 

The preceding code results in the following output:

HTTP requests

And with that, we have the HTML of each story.

Since the content is embedded in HTML markup, and we want to feed plain text into our model, we'll use a parser to strip out the markup tags:

from bs4 import BeautifulSoup 
 
def get_text(x): 
    soup = BeautifulSoup(x, 'html5lib') 
    text = soup.get_text() 
    return text 
 
df.loc[:,'text'] = df['html'].map(get_text) 
 
df 

The preceding code results in the following output:

And with that, we have our training set ready. We can now move on to a discussion of how to transform our text into something that a model can work with.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.12.140