Inspecting and preparing the data

Let's begin by inspecting the data points for each of our columns. We want to look for odd and outlier values in our data. We will start by looking at the bedroom and bathroom columns:

In the following code, we look at the unique values for bedrooms:

df['beds'].unique()

The preceding code results in the following output:

Now, let's look at bathrooms. We do that in the following code:

df['baths'].unique()

The preceding code results in the following output:

Based on the output from the two preceding queries, we see that we need to correct some items that have a leading underscore. Let's do that now:

df['beds'] = df['beds'].map(lambda x: x[1:] if x.startswith('_') else x) 
df['baths'] = df['baths'].map(lambda x: x[1:] if x.startswith('_') else x)

In the preceding code, we ran a pandas map function with a lambda function that essentially checks whether the element begins with an underscore and, if so, removes it. A quick check of the unique values for beds and baths should reveal that our erroneous starting underscores have been removed:

df['beds'].unique()

The preceding code results in the following output:

Let's execute the following line of code and look at the results:

df['baths'].unique()

The preceding code results in the following output:

Next, we want to look at some descriptive statistics to better understand our data. One way to do that is with the describe method. Let's try that in the following code:

df.describe()

The preceding code results in the following output:

While we were hoping to get metrics such as the average number of beds and baths, and things like the max rent, what we instead received was much less than that. The problem is that the data is not the correct data type for these operations. Pandas can't perform those types of operation on what are string objects. We will need to clean up our data further and set it to the correct data types. We will do that in the following code:

df['rent'] = df['rent'].map(lambda x: str(x).replace('$','').replace(',','')).astype('int') 
df['beds'] = df['beds'].map(lambda x: x.replace('_Bed', '')) 
df['beds'] = df['beds'].map(lambda x: x.replace('Studio', '0')) 
df['beds'] = df['beds'].map(lambda x: x.replace('Loft', '0')).astype('int') 
df['baths'] = df['baths'].map(lambda x: x.replace('_Bath', '')).astype('float')

What we have done in the preceding code is to remove anything that is non-numeric from each of the values. You can see that we removed _Bed and _Bath to leave just the number, and that we replaced words such as Studio and Loft with the actual number of bedrooms, which is zero.

Table of Contents for Inspecting and preparing the data

Create new playlist

Sign In

Sign Up

Table of Contents for
Inspecting and preparing the data