Handling missing data

Whenever there are missing values, a NaN value is used, which indicates that there is no value specified for that particular index. There could be several reasons why a value could be NaN:

  • It can happen when data is retrieved from an external source and there are some incomplete values in the dataset. 
  • It can also happen when we join two different datasets and some values are not matched. 
  • Missing values due to data collection errors. 
  • When the shape of data changes, there are new additional rows or columns that are not determined. 
  • Reindexing of data can result in incomplete data. 

Let's see how we can work with the missing data:

  1. Let's assume we have a dataframe as shown here:
data = np.arange(15, 30).reshape(5, 3)
dfx = pd.DataFrame(data, index=['apple', 'banana', 'kiwi', 'grapes', 'mango'], columns=['store1', 'store2', 'store3'])
dfx

And the output of the preceding code is as follows:

Assume we have a chain of fruit stores all over town. Currently, the dataframe is showing sales of different fruits from different stores. None of the stores are reporting missing values.

  1. Let's add some missing values to our dataframe:
dfx['store4'] = np.nan
dfx.loc['watermelon'] = np.arange(15, 19)
dfx.loc['oranges'] = np.nan
dfx['store5'] = np.nan
dfx['store4']['apple'] = 20.
dfx

And the output will now look like the following screenshot:

Note that we've added two more stores, store4 and store5, and two more types of fruits, watermelon and oranges. Assume that we know how many kilos of apples and watermelons were sold from store4, but we have not collected any data from store5. Moreover, none of the stores reported sales of oranges. We are quite a huge fruit dealer, aren't we? 

Note the following characteristics of missing values in the preceding dataframe:

  • An entire row can contain NaN values. 
  • An entire column can contain NaN values. 
  • Some (but not necessarily all) values in both a row and a column can be NaN.

Based on these characteristics, let's examine NaN values in the next section. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.30.19