Analysis and visualization using pandas and matplotlib

In this section, we will be exploring a few basic concepts with regard to analyzing data using pandas and plotting general charts using matplotlib. 

pandas is one of the most popular data analysis libraries in recent times. Data analysis and visualization are major tasks and can be performed with the help of pandas and other libraries such as matplotlib. 

For more details and documentation on pandas and matplotlib, please visit their official sites at https://pandas.pydata.org/ and https://matplotlib.org/.

pandas is also termed and used as a raw spreadsheet. It supports mathematical, statistical and query-type statements, and allows you to read from and write to various files. It is also popular among developers and analysts since it has easy functions and properties available that can help you handle data that's in a row and column structure:

Exploring pandas using the Python IDE

In this section, we will be reading data from the bookdetails.csv file and conducting analysis and visualization using the file's data. Let's import the libraries that are required, that is, pandas and matplotlib.pyplot. We will be using the  pd and plt aliases, respectively, and reading the data from the file:

import pandas as pd
import matplotlib.pyplot as plt

dataSet = pd.read_csv('bookdetails.csv') #loads the file content as dataframe.

print(type(dataSet)) #<class 'pandas.core.frame.DataFrame'>

As we can see, the read_csv() function reads the content from a CSV file and generates a DataFrame object. pandas also supports various data files via the use of functions such as read_html(), read_excel(), read_json(), and read_sql_table().

Here, dataSet is an object of the pandas DataFrame. The DataFrame represents a two-dimensional tabular structure with rows, columns, and indexes. Query-level analysis, conditional statements, filtering, grouping, and more are supported by DataFrames against data in rows and columns:

print(dataSet)

The following screenshot displays the content that's now available in dataSet:

Dataset contents from a CSV file

Row indexes are also shown, all of which start with 0 (zero). The general statistical output can be obtained by using the describe() function:

print(dataSet.describe()) 
#print(dataSet.describe('price') will only generate values for column price

Price Rating
count 13.000000 13.000000
mean 35.633077 3.230769
std 14.239014 1.535895
min 12.360000 1.000000
25% 28.800000 2.000000
50% 35.020000 3.000000
75% 45.000000 5.000000
max 57.250000 5.000000

As we can see, by default, describe() selects the columns that are applicable to statistical functions and returns calculations with the following functions:

  • count: Number of rows

  • mean: Average value for the related column

  • min: Minimum value found

  • max: Maximum value found

  • std: Calculated standard deviation

  • 25%Returns the 25th percentile

  • 50%Returns the 50th percentile

  • 75%: Returns the 75th percentile

In the following code, we are selecting an individual column called Price as price_group. All of the columns from the dataset can be listed using dataSet.columns. Multiple columns can be selected by using the following dataSet[['Price','Rating']] format:

print(dataSet.columns)
Index(['Title', 'Price', 'Stock', 'Rating'], dtype='object')

print(sum(dataSet['Price']))
463.23

print(sum(dataSet['Rating']))
42

print(dataSet['Price'][0:5])
0 35.02
1 57.25
2 37.32
3 18.03
4 31.19
Name: Price, dtype: float64

The following code shows the individual data for the Price column:

#dataSet[['Price','Rating']] will select both column
price_group = dataSet[['Price']] #selecting 'Price' column only.
print(price_group)

Index(['Title', 'Price', 'Stock', 'Rating'], dtype='object')
Price
0 35.02
1 57.25
2 37.32
.....
11 55.66
12 28.80

pandas DataFrames also accept conditions or filtering actions being used on columns. As you can see, the filter is applied to Rating for values that are >=4.0, and only Title and Price are going to be returned:

 print(dataSet[dataSet['Rating']>=4.0][['Title','Price']])

Title Price
0 Rip it Up and ... 35.02
1 Our Band Could Be ... 57.25
4 Please Kill Me: The ...31.19
5 Kill 'Em and Leave: ...45.00
9 No One Here Gets ... 20.02
10 Life 31.58

Similarly, string-based filters can also be applied. Stock, which contains the Out text, is filtered, and the output returns all the columns that satisfy the Out text. The contains() function accepts regular expressions and strings:

print(dataSet[dataSet.Stock.str.contains(r'Out')])

Title Price Stock Rating
3 Love Is a Mix ... 18.03 Out of stock 1
6 Chronicles, Vol. 1 52.60 Out of stock 2
11 Old Records Never Die: ...55.66 Out of Stock 2

#will return only column 'Price'
#print(dataSet[dataSet.Stock.str.contains(r'Out')]['Price'])

The between() function is supplied with values that refer to Rating to filter and return Title of the books:

print(dataSet[dataSet.Rating.between(3.5,4.5)]['Title'])

1 Our Band Could Be ...
4 Please Kill Me: The ...

Since we have the price_group data, we can call the plot() function on the data with the help of the show() function:

 bar_plot = price_group.plot()  #default plot
bar_plot.set_xlabel("No of Books") #set X axis: label
bar_plot.set_ylabel("Price") #set Y axis: label
plt.show() #displays the plot or chart created

The preceding code will generate a line chart with default properties, such as colors and legend placements, as follows: 

Default line chart for the Price column

We can also change the kind of chart, that is, line, bar, and more.

Visit matplotlib at https://matplotlib.org/gallery/index.html to find out more about various functional chart types and their additional associated properties.

In the following code, kind='bar' overwrites the default line type:

bar_plot = price_group.plot(kind='bar') #kind='bar'
bar_plot.set_xlabel("No of Books") #Label for X-Axis
bar_plot.set_ylabel("Price") #label for Y-Axis
plt.show()

 The preceding code generates the following bar chart:

Bar chart for the Price column

So far, we have used a basic chart type with a single column. In the following code, we are plotting a bar chart with the Price and Rating values:

price_group = dataSet[['Price','Rating']]  #obtain both columns
#title: generates a title for plot
bar_plot = price_group.plot(kind='bar',title="Book Price ad Rating")
bar_plot.set_xlabel("No of Books")
bar_plot.set_ylabel("Price")
plt.show()

We receive the following output:

Bar chart with Price and Rating columns

So far, we have successfully plotted line and bar charts. The following code plots a pie chart for the first six items from the Price column and labels them with the first six Title available from dataSet:

prices = dataSet['Price'][0:6] #Price from first 6 items
labels = dataSet['Title'][0:6] #Book Titles from first 6 items
legends,ax1 = plt.pie(prices, labels=labels, shadow=True, startangle=45)
plt.legend(legends, prices, loc="best") #legend built using Prices
plt.show()

The values from Price are used as legends. We receive the following output:

Pie chart with Price and Title column data

There's a lot more to explore in terms of using pandas and matplotlib. In this section, we have displayed the basic features that are available from both libraries. Now, we will look at ML.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.133.206