In the previous examples we used R to work through data frames that were built from converted JSON to CSV files. If we were to use the Yelp businesses rating file we could use Python directly, as it is much smaller and produces similar results.
In this example, we gather cuisines from the Yelp file based on whether the business category includes restaurants. We accumulate the ratings for all cuisines and then produce averages for each.
We read in the JSON file into separate lines and convert each line into a Python object:
import json #filein = 'c:/Users/Dan/business.json' filein = 'c:/Users/Dan/yelp_academic_dataset_business.json' lines = list(open(filein))
We use a dictionary for the ratings for a cuisine. The key of the dictionary is the name of the cuisine. The value of the dictionary is a list of ratings for that cuisine:
ratings = {} for line in lines: line = unicode(line, errors='ignore') obj = json.loads(line) if obj['categories'] == None: continue if 'Restaurants' in obj['categories']: rating = obj['stars'] for category in obj['categories']: if category not in ratings: ratings[category] = [] clist = ratings.get(category) clist.append(rating)
Now that we have gathered all of the ratings, we can produce a new dictionary of cuisines with average ratings. We also accumulate a total to produce an overall average and track the highest rated cuisine:
cuisines = {} total = 0 cmax = '' maxc = 0 for cuisine in ratings: clist = ratings[cuisine] if len(clist) < 10: continue avg = float(sum(clist))/len(clist) cuisines[cuisine] = avg total = total + avg if avg > maxc: maxc = avg cmax = cuisine print ("Highest rated cuisine is ",cmax," at ",maxc) print ("Average cuisine rating is ",total/len(ratings)) print (cuisines)
It is interesting that Personal Chefs is the highest rated. I have only heard about celebrities having a personal chef, but the data shows it may be worthwhile. An average of 1.6 is abysmal for all cuisines. The data did not appear to have a balance of high and low ratings when we looked earlier. However, looking through the resulting output, there are many items that are not cuisines, even though the Restaurants key is present. I had tried to eliminate the bad data by only counting cuisines with 10 or more ratings, which eliminated some of the bad data, but there are still many erroneous records in play.