Indexing or subsetting dataframes

While working on a client dataset with a large number of observations, it is required to subset the data based on some selection criteria and with or without replacement-based sampling. Indexing is the process of extracting the subset of data from the dataframe based on some logical conditions. The subset function helps in extracting elements from the data frame like indexing:

> newdata <- audit[ which(audit$Gender=="Female" & audit$Age > 65), ]
> rownames(newdata)
 [1] "49"   "537"  "552"  "561"  "586"  "590"  "899"  "1200" "1598" "1719"

The preceding code explains: select those observations from the audit dataset where the gender is female and the age is more than 65 years. Which command is used to select that subset of data audit based on the preceding two criteria? There are 10 observations satisfying the preceding condition; the row numbers of the data frame are printed previously. A similar result can be obtained by using the subset function as well. Instead of the which function, the subset function should be used, as the latter is more efficient in passing multiple conditions. Let's take a look at the way the subset function is used:

> newdata <- subset(audit, Gender=="Female" & Age > 65, select=Employment:Income)
> rownames(newdata)
 [1] "49"   "537"  "552"  "561"  "586"  "590"  "899"  "1200" "1598" "1719"

The additional argument in the subset function makes the function more efficient as it provides the additional benefit of selecting specific columns from the dataframe where the logical condition is satisfied.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.251.206