Showing the density of bivariate data with hexbin plots

Scatter plot is a common method to show the distribution of data in a more raw form. But when data density goes over a threshold, it may not be the best visualization method as points can overlap and we lose information about the actual distribution.

A hexbin map is a way to improve the interpretation of data density, by showing the data density in an area by color intensity.

Here is an example to compare the visualization of the same dataset that aggregates in the center:

import pandas as pd
import numpy as np
# Prepare 2500 random data points densely clustered at center
np.random.seed(123)

df = pd.DataFrame(np.random.randn(2500, 2), columns=['x', 'y'])
df['y'] = df['y'] = df['y'] + np.arange(2500)
df['z'] = np.random.uniform(0, 3, 2500)

# Plot the scatter plot
ax1 = df.plot.scatter(x='x', y='y')
# Plot the hexbin plot
ax2 = df.plot.hexbin(x='x', y='y', C='z', reduce_C_function=np.max,gridsize=25)

plt.show()

This is the scatter plot in ax1. We can see that many data points are overlapping:

As for the hexbin map in ax2, although not all discrete raw data points are shown, we can clearly see the variation of data distribution in the center:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.33.136