Numerical data

For numerical data, we will check statistically both what the distribution of values is and whether there are any missing values:

df[['watchers_count','size','forks_count','open_issues']].describe() 
NA

watchers_count

size

forks_count

open_issues

count

787.000000

7.870000e+02

787.000000

787.000000

mean

696.656925

2.079364e+04

129.127065

13.270648

std

1652.487989

2.056537e+05

208.778257

30.909030

min

0.000000

0.000000e+00

33.000000

0.000000

25%

0.000000

3.250000e+01

58.000000

0.000000

50%

124.000000

4.540000e+02

76.000000

3.000000

75%

656.500000

6.859000e+03

99.000000

12.000000

max

20792.000000

5.657792e+06

2589.000000

458.000000

 

We see that there are no missing values in all four variables: watchers_count, size, forks_count, and open issues. The watchers_count varies from 0 to 20,792 while the minimum number of forks is 33 and goes up to 2,589. The first quartile of repositories has no open issues while top 25% have more than 12. It is worth noticing that, in our dataset, there is a repository which has 458 open issues.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.120.206