Spark evaluating history data

In this example, we combine the previous sections to look at some historical data and determine a number of useful attributes.

The historical data we are using is the guest list for the Jon Stewart television show. A typical record from the data looks as follows:

1999,actor,1/11/99,Acting,Michael J. Fox

This contains the year, the occupation of the guest, the date of appearance, a logical grouping of the occupations, and the name of the guest.

For our analysis, we will be looking at the number of appearances per year, the occupation that appears most frequently, and the personality who appears most frequently.

We will be using this script:

#Spark Daily Show Guests
import pyspark
import csv
import operator
import itertools
import collections

if not 'sc' in globals():
 sc = pyspark.SparkContext()
 
years = {}
occupations = {}
guests = {}

#file header contains column descriptors:
#YEAR, GoogleKnowledge_Occupation, Show, Group, Raw_Guest_List

with open('daily_show_guests.csv', 'rt', errors = 'ignore') as csvfile: 
 reader = csv.DictReader(csvfile)
 for row in reader:
 year = row['YEAR']
 if year in years:
 years[year] = years[year] + 1
 else:
 years[year] = 1
 
 occupation = row['GoogleKnowlege_Occupation']
 if occupation in occupations:
 occupations[occupation] = occupations[occupation] + 1
 else:
 occupations[occupation] = 1
 
 guest = row['Raw_Guest_List']
 if guest in guests:
 guests[guest] = guests[guest] + 1
 else:
 guests[guest] = 1
 
#sort for higher occurrence
syears = sorted(years.items(), key = operator.itemgetter(1), reverse = True)
soccupations = sorted(occupations.items(), key = operator.itemgetter(1), reverse = True)
sguests = sorted(guests.items(), key = operator.itemgetter(1), reverse = True)

#print out top 5's
print(syears[:5])
print(soccupations[:5])
print(sguests[:5])

The script has a number of features:

We are using several packages.
It has the familiar context preamble.
We start dictionaries for years, occupations, and guests. A dictionary contains key and value. For this use, the key will be the raw value from the CSV. The value will be the number of occurrences in the dataset.
We open the file and start reading line by line, using a reader object. We are using ignore errors as there are a couple of nulls in the file.
On each line, we take the value of interest (year, occupation, name).
We see if the value is present in the appropriate dictionary.
If it is there, increment the value (counter).
Otherwise, initialize an entry in the dictionary.
We then sort each of the dictionaries in reverse order of the number of appearances of the item.
Finally, we display the top five values for each dictionary.

If we run this in a Notebook, we have the following output:

We show the tail of the script and the preceding output.

There may be a smarter way to do all of this, but I am not aware of it.

The build-up of the accumulators is pretty standard, regardless of what language you are using. I think there is an opportunity to use a map() function here.

I really liked just trimming off the lists/arrays so easily instead of having to call a function.

The number of guests per year is very consistent. Actors are prevalent—probably the group of people of most interest to the audience. The guest list was a little surprising. The guests are mostly actors, but I think all have strong political directions.

Table of Contents for Spark evaluating history data

Create new playlist

Sign In

Sign Up

Table of Contents for
Spark evaluating history data