Determining relationships between number of ratings and ratings

Given the preceding results it appears that people mostly only vote in a positive manner. We can look to see if there is a relationship between how many votes a company has received and their rating.

First, we accumulate the dataset using the following script, extracting the number of votes and rating for each firm:

#determine relationship between number of reviews and star rating
import pandas as pd
from pandas import DataFrame as df  
import numpy as np  

dfr2 = pd.DataFrame(columns=['reviews', 'rating'])
mynparray = dfr2.values

for line in lines:
    line = unicode(line, errors='ignore')
    obj = json.loads(line)
    reviews = int(obj['review_count'])
    rating = float(obj['stars'])
    arow = [reviews,rating]
    mynparray = np.vstack((mynparray,arow)) 

dfr2 = df(mynparray)
print (len(dfr2))  

This coding just builds the data frame with our two variables. We are using NumPy as it more easily adds a row to a data frame. Once we are done with all records we convert the NumPy data frame back to a pandas data frame.

The column names have been lost in the translation, so we put those back in and draw some summary statistics:

dfr2.columns = ['reviews', 'rating']
dfr2.describe()

In the output shown as follows we see the layout of the reviews and rating data we have collected. Yelp has not constrained its data entry for this dataset. There should 5 unique values for rating:

Next, we plot the data for a visual clue to the relationship, using the following:

#import matplotlib.pyplot as plt
dfr2.plot(kind='scatter', x='rating', y='reviews')
plt.show()  

So, the data after all, appears to have a clear Poisson distribution as compared to the earlier business_rating histogram.

Next, we compute the regression parameters:

#compute regression
import statsmodels.formula.api as smf

# create a fitted model in one line
lm = smf.ols(formula='rating ~ reviews', data=dfr2).fit()

# print the coefficients
lm.params  

We computed intercepts for all rating values. I had expected a single value.

Now, we determine the range of the observed data using the following:

#min, max observed values
X_new = pd.DataFrame({'reviews': [dfr2.reviews.min(), dfr2.reviews.max()]})
X_new.head()  

So, as we guessed earlier, some businesses have a very large number of reviews.

Now, we can make predictions based on the extent data points:
#make corresponding predictions
preds = lm.predict(X_new)
preds  

We are seeing a much bigger range of predicted values than expected. Plot out the observed and predicted data:

# first, plot the observed data
dfr2.plot(kind='scatter', x='reviews', y='rating')

# then, plot the least squares line
plt.plot(X_new, preds, c='red', linewidth=2)
plt.show()  

In the plot displayed as follows, there does not appear to be a relationship between the number of reviews and the review score for a firm. It appears to be a numbers gameā€”if you get people to review your firm, on average they will give you a high score.

There does not appear to be a relationship between the number of reviews and the review score for a firm.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.15.149