Exploring the story content

In the last section, we created a function to examine the common n-grams that are found in the headlines of our stories. Now, let's apply that to explore the full content of our stories.

We'll start by exploring bi-grams with the stop words removed. Since headlines are so short compared to the body of the stories, it makes sense to look at them with the stop words intact, although within the story, it typically makes sense to eliminate them:

hw,hl = get_word_stats(dfc['text'], 2, 1) 
 
hw 

This generates the following output:

Interestingly, we can see that the frivolity we saw in the headlines has completely disappeared. The text is now filled with content discussing terrorism, politics, and race relations.

How is it possible that the headlines are light-hearted, while the text is dark and controversial? I would suggest that this is because articles such as 13 Puppies Who Look Like Elvis are going to have substantially less text than The History of the Islamic State.

Let's take a look at one more. We'll evaluate the tri-grams for the story bodies:

hw,hl = get_word_stats(dfc['text'], 3, 1) 
 
hw 

This code generates the following output:

We appear to have suddenly entered the land of advertising and social pandering. With that, let's move on to building a predictive model for content scoring.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.139.97.202