Exploration of the data

Let's look at the shape of our dataset:

hotel_reviews.shape

(35912, 19)

This is showing us that we are working with 35,912 rows and 19 columns. Eventually, we will be concerned only with the column that contains the text data, but for now, let's see what the first few rows look like to get a better sense of what is included in our data:

hotel_reviews.head()

This gives us the following table:

address

categories

city

country

latitude

longitude

name

postalCode

province

reviews.
date

reviews.
dateAdded

reviews.
doRecommend

reviews.
id

reviews.
rating

reviews.
text

reviews.

title

reviews.

userCity

reviews.

username

reviews.

userProvince

0

Riviera San Nicol 11/a

Hotels

Mableton

US

45.421611

12.376187

Hotel Russo Palace

30126

GA

2013-09-22T00:00:00Z

2016-10-24T00:00:25Z

NaN

NaN

4.0

Pleasant 10 min walk along the sea front to th...

Good location away from the crouds

NaN

Russ (kent)

NaN

1

Riviera San Nicol 11/a

Hotels

Mableton

US

45.421611

12.376187

Hotel Russo Palace

30126

GA

2015-04-03T00:00:00Z

2016-10-24T00:00:25Z

NaN

NaN

5.0

Really lovely hotel. Stayed on the very top fl...

Great hotel with Jacuzzi bath!

NaN

A Traveler

NaN

2

Riviera San Nicol 11/a

Hotels

Mableton

US

45.421611

12.376187

Hotel Russo Palace

30126

GA

2014-05-13T00:00:00Z

2016-10-24T00:00:25Z

NaN

NaN

5.0

Ett mycket bra hotell. Det som drog ner betyge...

Lugnt l��ge

NaN

Maud

NaN

3

Riviera San Nicol 11/a

Hotels

Mableton

US

45.421611

12.376187

Hotel Russo Palace

30126

GA

2013-10-27T00:00:00Z

2016-10-24T00:00:25Z

NaN

NaN

5.0

We stayed here for four nights in October. The...

Good location on the Lido.

NaN

Julie

NaN

4

Riviera San Nicol 11/a

Hotels

Mableton

US

45.421611

12.376187

Hotel Russo Palace

30126

GA

2015-03-05T00:00:00Z

2016-10-24T00:00:25Z

NaN

NaN

5.0

We stayed here for four nights in October. The...

������ ���������������

NaN

sungchul

NaN

 

Let's only include reviews from the United States in order to try and include only English reviews. First, let's plot our data, like so:

# plot the lats and longs of reviews
hotel_reviews.plot.scatter(x='longitude', y='latitude')

The output looks something like this:

For the purpose of making our dataset a bit easier to work with, let's use pandas to subset the reviews and only include those that came from the United States:

# Filter to only include reviews within the US
hotel_reviews = hotel_reviews[((hotel_reviews['latitude']<=50.0) & (hotel_reviews['latitude']>=24.0)) & ((hotel_reviews['longitude']<=-65.0) & (hotel_reviews['longitude']>=-122.0))]

# Plot the lats and longs again
hotel_reviews.plot.scatter(x='longitude', y='latitude')
# Only looking at reviews that are coming from the US

The output is as follows:

It looks like a map of the U.S.! Let's shape our filtered dataset now:

hotel_reviews.shape

We have 30,692 rows and 19 columns. When we write reviews for hotels, we usually write about different things in the same review. For this reason, we will attempt to assign topics to single sentences rather than to the entire review.

To do so, let's grab the text column from our data, like so:

texts = hotel_reviews['reviews.text']
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.151.61