Factors/categorical data

R refers to categorical variables as factors, and the cut() function enables us to break a continuous numerical variable into ranges, and treat the ranges as factors or categorical variables, or to classify a categorical variable into a larger bin.

An R example using cut()

Here is an example in R:

clinical.trial<- data.frame(patient = 1:1000,
age = rnorm(1000, mean = 50, sd = 5),
year.enroll = sample(paste("19", 80:99, sep = ""),
                             1000, replace = TRUE))

>clinical.trial<- data.frame(patient = 1:1000,
+                              age = rnorm(1000, mean = 50, sd = 5),
+                              year.enroll = sample(paste("19", 80:99, sep = ""),
+                              1000, replace = TRUE))
>summary(clinical.trial)
patient            age         year.enroll
 Min.   :   1.0   Min.   :31.14   1995   : 61  
 1st Qu.: 250.8   1st Qu.:46.77   1989   : 60  
Median : 500.5   Median :50.14   1985   : 57  
 Mean   : 500.5   Mean   :50.14   1988   : 57  
 3rd Qu.: 750.2   3rd Qu.:53.50   1990   : 56  
 Max.   :1000.0   Max.   :70.15   1991   : 55  
                                  (Other):654  
>ctcut<- cut(clinical.trial$age, breaks = 5)> table(ctcut)
ctcut
(31.1,38.9] (38.9,46.7] (46.7,54.6] (54.6,62.4] (62.4,70.2]
         15         232         558         186           9

The reference for the preceding data can be found at: http://www.r-bloggers.com/r-function-of-the-day-cut/.

The pandas solution

Here is the equivalent of the earlier explained cut() function in pandas (only applies to Version 0.15+):

In [79]: pd.set_option('precision',4)
clinical_trial=pd.DataFrame({'patient':range(1,1001), 
                                      'age' : np.random.normal(50,5,size=1000),
                 'year_enroll': [str(x) for x in np.random.choice(range(1980,2000),size=1000,replace=True)]})

In [80]: clinical_trial.describe()
Out[80]:        age       patient
count   1000.000  1000.000
mean    50.089    500.500
std     4.909     288.819
min     29.944    1.000
        25%     46.572    250.750
        50%     50.314    500.500
        75%     53.320    750.250
max     63.458    1000.000


In [81]: clinical_trial.describe(include=['O'])
Out[81]:        year_enroll
count   1000
unique  20
top     1992
freq    62


In [82]: clinical_trial.year_enroll.value_counts()[:6]
Out[82]: 1992    62
         1985    61
         1986    59
         1994    59
         1983    58
         1991    58
dtype: int64
In [83]: ctcut=pd.cut(clinical_trial['age'], 5)
In [84]: ctcut.head()
Out[84]: 0    (43.349, 50.052]
         1    (50.052, 56.755]
         2    (50.052, 56.755]
         3    (43.349, 50.052]
         4    (50.052, 56.755]
         Name: age, dtype: category
         Categories (5, object): [(29.91, 36.646] < (36.646, 43.349] < (43.349, 50.052] < (50.052, 56.755] < (56.755, 63.458]]

In [85]: ctcut.value_counts().sort_index()
Out[85]: (29.91, 36.646]       3
              (36.646, 43.349]     82
       (43.349, 50.052]    396
       (50.052, 56.755]    434
      (56.755, 63.458]     85
dtype: int64
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.26.138