For numerical data, we will check statistically both what the distribution of values is and whether there are any missing values:
df[['watchers_count','size','forks_count','open_issues']].describe()
NA |
watchers_count |
size |
forks_count |
open_issues |
count |
787.000000 |
7.870000e+02 |
787.000000 |
787.000000 |
mean |
696.656925 |
2.079364e+04 |
129.127065 |
13.270648 |
std |
1652.487989 |
2.056537e+05 |
208.778257 |
30.909030 |
min |
0.000000 |
0.000000e+00 |
33.000000 |
0.000000 |
25% |
0.000000 |
3.250000e+01 |
58.000000 |
0.000000 |
50% |
124.000000 |
4.540000e+02 |
76.000000 |
3.000000 |
75% |
656.500000 |
6.859000e+03 |
99.000000 |
12.000000 |
max |
20792.000000 |
5.657792e+06 |
2589.000000 |
458.000000 |
We see that there are no missing values in all four variables: watchers_count, size, forks_count, and open issues. The watchers_count varies from 0 to 20,792 while the minimum number of forks is 33 and goes up to 2,589. The first quartile of repositories has no open issues while top 25% have more than 12. It is worth noticing that, in our dataset, there is a repository which has 458 open issues.