Using Python to compare ratings

In the previous examples we used R to work through data frames that were built from converted JSON to CSV files. If we were to use the Yelp businesses rating file we could use Python directly, as it is much smaller and produces similar results.

In this example, we gather cuisines from the Yelp file based on whether the business category includes restaurants. We accumulate the ratings for all cuisines and then produce averages for each.

We read in the JSON file into separate lines and convert each line into a Python object:

We convert each line to Unicode with the errors=ignore option. This is due to many erroneous characters present in the data file.

import json
#filein = 'c:/Users/Dan/business.json'
filein = 'c:/Users/Dan/yelp_academic_dataset_business.json'
lines = list(open(filein))

We use a dictionary for the ratings for a cuisine. The key of the dictionary is the name of the cuisine. The value of the dictionary is a list of ratings for that cuisine:

ratings = {}
for line in lines:
    line = unicode(line, errors='ignore')
    obj = json.loads(line)
    if obj['categories'] == None:
        continue
    if 'Restaurants' in obj['categories']:
        rating = obj['stars']
        for category in obj['categories']:
            if category not in ratings:
                ratings[category] = []
            clist = ratings.get(category)
            clist.append(rating)

Now that we have gathered all of the ratings, we can produce a new dictionary of cuisines with average ratings. We also accumulate a total to produce an overall average and track the highest rated cuisine:

cuisines = {}
total = 0
cmax = ''
maxc = 0
for cuisine in ratings:
    clist = ratings[cuisine]
    if len(clist) < 10:
        continue
    avg = float(sum(clist))/len(clist)
    cuisines[cuisine] = avg
    total = total + avg
    if avg > maxc:
        maxc = avg
        cmax = cuisine

print ("Highest rated cuisine is ",cmax," at ",maxc)
print ("Average cuisine rating is ",total/len(ratings))

print (cuisines)

It is interesting that Personal Chefs is the highest rated. I have only heard about celebrities having a personal chef, but the data shows it may be worthwhile. An average of 1.6 is abysmal for all cuisines. The data did not appear to have a balance of high and low ratings when we looked earlier. However, looking through the resulting output, there are many items that are not cuisines, even though the Restaurants key is present. I had tried to eliminate the bad data by only counting cuisines with 10 or more ratings, which eliminated some of the bad data, but there are still many erroneous records in play.

Table of Contents for Using Python to compare ratings

Create new playlist

Sign In

Sign Up

Table of Contents for
Using Python to compare ratings