Finally, we apply sentiment analysis to the significantly larger Yelp business review dataset with five outcome classes. The data consists of several files with information on the business, the user, the review, and other aspects that Yelp provides to encourage data science innovation.
We will use around six million reviews produced over the 2010-2018 period (see the relevant notebook for details). The following diagrams show the number of reviews and the average number of stars per year:
Graphs representing number of reviews and the average number of stars per year
In addition to the text features resulting from the review texts, we will also use other information submitted with the review or about the user.
We will train various models on data through 2017 and use 2018 as the test set.