For this chapter, we will be using the Yelp data available from the challenge at https://www.yelp.com/dataset/challenge. This section uses the dataset from round 9 of the challenge. For background, Yelp is a site for rating different products and services where Yelp publishes the ratings to users.
The dataset file is a very large (a few gigabytes) amount of ratings. There are several sets of rating information in the download-for business ratings, reviews, tips (as in this would be a nice place to visit), and a user set. We are interested in the review data.
When dealing with such large files it may be useful to find and use a large file editor so you can poke into the data file. On Windows, most of the standard editors are limited to a few megabytes. I used the Large Text File Viewer program to open these JSON files.
All of the files are in JSON format. JSON is a human readable format with structured elements—for example, a city object containing street objects. While it is convenient to read JSON the format is clumsy when dealing with large numbers of elements. In the reviews file there are a few million rows. So, we first convert the JSON to a flat CSV format to allow for easier processing using this script:
import time
import datetime
import json, csv
print( datetime.datetime.now().time())
headers = True
#with open('c:/Users/Dan/reviews.json') as jsonf, open('c:/Users/Dan/reviews.csv', "wb") as csvf:
filein = 'c:/Users/Dan/yelp_academic_dataset_review.json'
fileout = 'c:/Users/Dan/yelp_academic_dataset_review.csv'
with open(filein) as jsonf, open(fileout, "wb") as csvf:
for line in jsonf:
data = json.loads(line)
#remove the review text
data.pop('text')
if headers:
w = csv.DictWriter(csvf, data.keys())
w.writeheader()
headers = False
w.writerow(data)
print( datetime.datetime.now().time())
I am printing out the start and end times to get an idea of how long this takes. For my machine it took 1.5 minutes to convert the file. I had tried several versions of this code before I got the preceding code working at a satisfactory pace. While developing this script I took a small subset of the original data file (2000 rows) and worked with that file until things progressed sufficiently.
As you can see, I am reading the raw JSON file as provided from Yelp and writing out a corresponding CSV file.
The script reads each line of the JSON (one line contains an entire object) and writes out the corresponding CSV. I stripped out the review text as I was not evaluating the text of reviews and the review text took a lot of space. The review file size dropped from 3 gigabytes to 300 megabytes using this coding. Other than that we made sure to write the headers out to the CSV as the first record. I then used a separate script/notebook entry to read in the CSV and process it.