Converting JSON to CSV

For this chapter, we will be using the Yelp data available from the challenge at https://www.yelp.com/dataset/challenge. This section uses the dataset from round 9 of the challenge. For background, Yelp is a site for rating different products and services where Yelp publishes the ratings to users.

The dataset file is a very large (a few gigabytes) amount of ratings. There are several sets of rating information in the download-for business ratings, reviews, tips (as in this would be a nice place to visit), and a user set. We are interested in the review data.

When dealing with such large files it may be useful to find and use a large file editor so you can poke into the data file. On Windows, most of the standard editors are limited to a few megabytes. I used the Large Text File Viewer program to open these JSON files.

All of the files are in JSON format. JSON is a human readable format with structured elements—for example, a city object containing street objects. While it is convenient to read JSON the format is clumsy when dealing with large numbers of elements. In the reviews file there are a few million rows. So, we first convert the JSON to a flat CSV format to allow for easier processing using this script:

import time
import datetime
import json, csv
print( datetime.datetime.now().time())
headers = True
#with open('c:/Users/Dan/reviews.json') as jsonf, open('c:/Users/Dan/reviews.csv', "wb") as csvf:
filein = 'c:/Users/Dan/yelp_academic_dataset_review.json'
fileout = 'c:/Users/Dan/yelp_academic_dataset_review.csv'
with open(filein) as jsonf, open(fileout, "wb") as csvf:
for line in jsonf:
data = json.loads(line)
#remove the review text
data.pop('text')
if headers:
w = csv.DictWriter(csvf, data.keys())
w.writeheader()
headers = False
w.writerow(data)

print( datetime.datetime.now().time())

I am printing out the start and end times to get an idea of how long this takes. For my machine it took 1.5 minutes to convert the file. I had tried several versions of this code before I got the preceding code working at a satisfactory pace. While developing this script I took a small subset of the original data file (2000 rows) and worked with that file until things progressed sufficiently.

As you can see, I am reading the raw JSON file as provided from Yelp and writing out a corresponding CSV file.

The script reads each line of the JSON (one line contains an entire object) and writes out the corresponding CSV. I stripped out the review text as I was not evaluating the text of reviews and the review text took a lot of space. The review file size dropped from 3 gigabytes to 300 megabytes using this coding. Other than that we made sure to write the headers out to the CSV as the first record. I then used a separate script/notebook entry to read in the CSV and process it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.172.249