Analysis and visualization using pandas and matplotlib

In this section, we will be exploring a few basic concepts with regard to analyzing data using pandas and plotting general charts using matplotlib.

pandas is one of the most popular data analysis libraries in recent times. Data analysis and visualization are major tasks and can be performed with the help of pandas and other libraries such as matplotlib.

For more details and documentation on pandas and matplotlib, please visit their official sites at https://pandas.pydata.org/ and https://matplotlib.org/.

pandas is also termed and used as a raw spreadsheet. It supports mathematical, statistical and query-type statements, and allows you to read from and write to various files. It is also popular among developers and analysts since it has easy functions and properties available that can help you handle data that's in a row and column structure:

Exploring pandas using the Python IDE

In this section, we will be reading data from the bookdetails.csv file and conducting analysis and visualization using the file's data. Let's import the libraries that are required, that is, pandas and matplotlib.pyplot. We will be using the pd and plt aliases, respectively, and reading the data from the file:

import pandas as pd
import matplotlib.pyplot as plt

dataSet = pd.read_csv('bookdetails.csv') #loads the file content as dataframe.

print(type(dataSet)) #<class 'pandas.core.frame.DataFrame'>

As we can see, the read_csv() function reads the content from a CSV file and generates a DataFrame object. pandas also supports various data files via the use of functions such as read_html(), read_excel(), read_json(), and read_sql_table().

Here, dataSet is an object of the pandas DataFrame. The DataFrame represents a two-dimensional tabular structure with rows, columns, and indexes. Query-level analysis, conditional statements, filtering, grouping, and more are supported by DataFrames against data in rows and columns:

print(dataSet)

The following screenshot displays the content that's now available in dataSet:

Dataset contents from a CSV file

Row indexes are also shown, all of which start with 0 (zero). The general statistical output can be obtained by using the describe() function:

print(dataSet.describe()) 
#print(dataSet.describe('price') will only generate values for column price
  
      Price      Rating
count 13.000000  13.000000
mean  35.633077  3.230769
std   14.239014  1.535895
min   12.360000  1.000000
25%   28.800000  2.000000
50%   35.020000  3.000000
75%   45.000000  5.000000
max   57.250000  5.000000

As we can see, by default, describe() selects the columns that are applicable to statistical functions and returns calculations with the following functions:

count: Number of rows
mean: Average value for the related column
min: Minimum value found
max: Maximum value found
std: Calculated standard deviation
25%: Returns the 25^th percentile
50%: Returns the 50^th percentile
75%: Returns the 75^th percentile

In the following code, we are selecting an individual column called Price as price_group. All of the columns from the dataset can be listed using dataSet.columns. Multiple columns can be selected by using the following dataSet[['Price','Rating']] format:

print(dataSet.columns)
Index(['Title', 'Price', 'Stock', 'Rating'], dtype='object')

print(sum(dataSet['Price']))
463.23

print(sum(dataSet['Rating']))
42

print(dataSet['Price'][0:5])
0 35.02
1 57.25
2 37.32
3 18.03
4 31.19
Name: Price, dtype: float64

The following code shows the individual data for the Price column:

#dataSet[['Price','Rating']] will select both column
price_group = dataSet[['Price']] #selecting 'Price' column only.
print(price_group) 

Index(['Title', 'Price', 'Stock', 'Rating'], dtype='object')
  Price
0 35.02
1 57.25
2 37.32
.....
11 55.66
12 28.80

pandas DataFrames also accept conditions or filtering actions being used on columns. As you can see, the filter is applied to Rating for values that are >=4.0, and only Title and Price are going to be returned:

 print(dataSet[dataSet['Rating']>=4.0][['Title','Price']])

  Title                  Price
0 Rip it Up and ...      35.02
1 Our Band Could Be ...  57.25
4 Please Kill Me: The ...31.19
5 Kill 'Em and Leave: ...45.00
9 No One Here Gets ...   20.02
10 Life                  31.58

Similarly, string-based filters can also be applied. Stock, which contains the Out text, is filtered, and the output returns all the columns that satisfy the Out text. The contains() function accepts regular expressions and strings:

print(dataSet[dataSet.Stock.str.contains(r'Out')])

   Title                     Price Stock        Rating
3  Love Is a Mix ...         18.03 Out of stock 1
6  Chronicles, Vol. 1        52.60 Out of stock 2
11 Old Records Never Die: ...55.66 Out of Stock 2

#will return only column 'Price'
#print(dataSet[dataSet.Stock.str.contains(r'Out')]['Price'])

The between() function is supplied with values that refer to Rating to filter and return Title of the books:

print(dataSet[dataSet.Rating.between(3.5,4.5)]['Title'])

1 Our Band Could Be ...
4 Please Kill Me: The ...

Since we have the price_group data, we can call the plot() function on the data with the help of the show() function:

 bar_plot = price_group.plot()  #default plot
 bar_plot.set_xlabel("No of Books") #set X axis: label
 bar_plot.set_ylabel("Price") #set Y axis: label
 plt.show() #displays the plot or chart created

The preceding code will generate a line chart with default properties, such as colors and legend placements, as follows:

Default line chart for the Price column

We can also change the kind of chart, that is, line, bar, and more.

Visit matplotlib at https://matplotlib.org/gallery/index.html to find out more about various functional chart types and their additional associated properties.

In the following code, kind='bar' overwrites the default line type:

bar_plot = price_group.plot(kind='bar') #kind='bar'
bar_plot.set_xlabel("No of Books")  #Label for X-Axis
bar_plot.set_ylabel("Price") #label for Y-Axis
plt.show()

The preceding code generates the following bar chart:

Bar chart for the Price column

So far, we have used a basic chart type with a single column. In the following code, we are plotting a bar chart with the Price and Rating values:

price_group = dataSet[['Price','Rating']]  #obtain both columns
#title: generates a title for plot
bar_plot = price_group.plot(kind='bar',title="Book Price ad Rating")
bar_plot.set_xlabel("No of Books")
bar_plot.set_ylabel("Price")
plt.show()

We receive the following output:

Bar chart with Price and Rating columns

So far, we have successfully plotted line and bar charts. The following code plots a pie chart for the first six items from the Price column and labels them with the first six Title available from dataSet:

prices = dataSet['Price'][0:6] #Price from first 6 items
labels = dataSet['Title'][0:6] #Book Titles from first 6 items
legends,ax1 = plt.pie(prices, labels=labels, shadow=True, startangle=45)
plt.legend(legends, prices, loc="best") #legend built using Prices
plt.show()

The values from Price are used as legends. We receive the following output:

Pie chart with Price and Title column data

There's a lot more to explore in terms of using pandas and matplotlib. In this section, we have displayed the basic features that are available from both libraries. Now, we will look at ML.

Table of Contents for Analysis and visualization using pandas and&#xA0;matplotlib

Create new playlist

Sign In

Sign Up

Table of Contents for
Analysis and visualization using pandas and matplotlib