Numerical data

For numerical data, we will check statistically both what the distribution of values is and whether there are any missing values:

df[['watchers_count','size','forks_count','open_issues']].describe()

NA	`watchers_count`	`size`	`forks_count`	`open_issues`
count	787.000000	7.870000e+02	787.000000	787.000000
mean	696.656925	2.079364e+04	129.127065	13.270648
std	1652.487989	2.056537e+05	208.778257	30.909030
min	0.000000	0.000000e+00	33.000000	0.000000
25%	0.000000	3.250000e+01	58.000000	0.000000
50%	124.000000	4.540000e+02	76.000000	3.000000
75%	656.500000	6.859000e+03	99.000000	12.000000
max	20792.000000	5.657792e+06	2589.000000	458.000000

We see that there are no missing values in all four variables: watchers_count, size, forks_count, and open issues. The watchers_count varies from 0 to 20,792 while the minimum number of forks is 33 and goes up to 2,589. The first quartile of repositories has no open issues while top 25% have more than 12. It is worth noticing that, in our dataset, there is a repository which has 458 open issues.

Table of Contents for Numerical data

Create new playlist

Sign In

Sign Up

Table of Contents for
Numerical data