String manipulation

String manipulation or character manipulation is an important aspect of any data management system. In a typical real-world dataset, names of customers for example are written in different ways, such as J H Smith, John h Smith, John h smith, and so on. Upon verifying, it is observed that all three names belong to the same person. In typical data management, it is important to standardize the text columns or variables in a dataset because R is case sensitive and it reads any discrepancy as a new data point. There can be many other variables such as the name/model of a vehicle, product description, and so on. Let's look how the text can be standardized using some functions:

> x<-"data Mining is not a difficult subject, anyone can master the subject"
> class(x)
[1] "character"
> substr(x, 1, 12)
[1] "data Mining "

The object X in the preceding script is a string or character object. The substr command is used to pull a sub string from the string with the position defined in the function. If certain patterns or texts need to be altered or changed, then the sub command can be used. There are four important arguments that the user needs to pass: the string in which a pattern needs to be searched, the pattern, the modified pattern that needs to be replaced, and whether case sensitivity is acceptable or not. Let's look at a sample script:

> sub("data mining", "The Data Mining", x, ignore.case =T, fixed=FALSE)
[1] "The Data Mining is not a difficult subject, anyone can master the subject"
> strsplit(x, "")
[[1]]
 [1] "d" "a" "t" "a" " " "M" "i" "n" "i" "n" "g" " " "i" "s" " " "n" "o" "t" " " "a" " "
[22] "d" "i" "f" "f" "i" "c" "u" "l" "t" " " "s" "u" "b" "j" "e" "c" "t" "," " " "a" "n"
[43] "y" "o" "n" "e" " " "c" "a" "n" " " "m" "a" "s" "t" "e" "r" " " "t" "h" "e" " " "s"
[64] "u" "b" "j" "e" "c" "t"

The strsplit function helps in expanding the letters from a string. The sub command is used to alter a pattern that is not right in the string. The ignore.Case option provides the user the chance to keep the case sensitivity on or off while searching for the pattern in the defined string.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.93.141