Determining the correlation between ratings and number of reviews

We could look at the correlation between star rating and just number of reviews using the following script:

# correlation number of reviews and number of stars
library(sqldf)
reviews_stars = sqldf("select stars,count(*) as reviews from reviews group by stars")
reviews_stars
cor(reviews_stars)

So, we see three times as many 5 star reviews as 1 star reviews. We also see a very high correlation between number of reviews and number of stars (0.8632361). People are only bothering to rate good firms. That makes it interesting to use Yelp only to determine if the firm is reviewed at all. If the firm is not rated (or not rated much) the unwritten reviews are bad.

We could visualize the relationship between ratings and number of reviews for companies using the following script:

#correlation business and rating
library(sqldf)
business_rating = sqldf("select business_id, avg(stars) as rating from reviews group by business_id order by 2 desc")
head(business_rating)
hist(business_rating$rating)

Where the business_rating data frame is a list of businesses and average star ratings. The resultant histogram is as follows:

This looks like a Poisson distribution. It is interesting that the distribution of firm ratings takes such a natural dispersion.

Table of Contents for Determining the correlation between ratings and number of reviews

Create new playlist

Sign In

Sign Up

Table of Contents for
Determining the correlation between ratings and number of reviews