We could look at the correlation between star rating and just number of reviews using the following script:
# correlation number of reviews and number of stars library(sqldf) reviews_stars = sqldf("select stars,count(*) as reviews from reviews group by stars") reviews_stars cor(reviews_stars)
So, we see three times as many 5 star reviews as 1 star reviews. We also see a very high correlation between number of reviews and number of stars (0.8632361). People are only bothering to rate good firms. That makes it interesting to use Yelp only to determine if the firm is reviewed at all. If the firm is not rated (or not rated much) the unwritten reviews are bad.
We could visualize the relationship between ratings and number of reviews for companies using the following script:
#correlation business and rating library(sqldf) business_rating = sqldf("select business_id, avg(stars) as rating from reviews group by business_id order by 2 desc") head(business_rating) hist(business_rating$rating)
Where the business_rating data frame is a list of businesses and average star ratings. The resultant histogram is as follows:
This looks like a Poisson distribution. It is interesting that the distribution of firm ratings takes such a natural dispersion.