While working on a client dataset with a large number of observations, it is required to subset the data based on some selection criteria and with or without replacement-based sampling. Indexing is the process of extracting the subset of data from the dataframe based on some logical conditions. The subset
function helps in extracting elements from the data frame like indexing:
> newdata <- audit[ which(audit$Gender=="Female" & audit$Age > 65), ] > rownames(newdata) [1] "49" "537" "552" "561" "586" "590" "899" "1200" "1598" "1719"
The preceding code explains: select those observations from the audit
dataset where the gender is female and the age is more than 65 years. Which command is used to select that subset of data audit
based on the preceding two criteria? There are 10 observations satisfying the preceding condition; the row numbers of the data frame are printed previously. A similar result can be obtained by using the subset
function as well. Instead of the which
function, the subset
function should be used, as the latter is more efficient in passing multiple conditions. Let's take a look at the way the subset
function is used:
> newdata <- subset(audit, Gender=="Female" & Age > 65, select=Employment:Income) > rownames(newdata) [1] "49" "537" "552" "561" "586" "590" "899" "1200" "1598" "1719"
The additional argument in the subset
function makes the function more efficient as it provides the additional benefit of selecting specific columns from the dataframe where the logical condition is satisfied.
18.226.251.206