Determining the correlation between ratings and number of reviews

We could look at the correlation between star rating and just number of reviews using the following script:

# correlation number of reviews and number of stars
library(sqldf)
reviews_stars = sqldf("select stars,count(*) as reviews from reviews group by stars")
reviews_stars
cor(reviews_stars)  

So, we see three times as many 5 star reviews as 1 star reviews. We also see a very high correlation between number of reviews and number of stars (0.8632361). People are only bothering to rate good firms. That makes it interesting to use Yelp only to determine if the firm is reviewed at all. If the firm is not rated (or not rated much) the unwritten reviews are bad.

We could visualize the relationship between ratings and number of reviews for companies using the following script:

#correlation business and rating
library(sqldf)
business_rating = sqldf("select business_id, avg(stars) as rating from reviews group by business_id order by 2 desc")
head(business_rating)
hist(business_rating$rating)  

Where the business_rating data frame is a list of businesses and average star ratings. The resultant histogram is as follows:

This looks like a Poisson distribution. It is interesting that the distribution of firm ratings takes such a natural dispersion.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.131.72