Managing data in R

Before we start any serious programming in R, we need to learn how to import data into an R environment and which data types R supports. Often, for a particular analysis, we will not use the entire dataset. Therefore, we need to also learn how to select a subset of the data for any analysis. This section will cover these aspects.

Data Types in R

R has five basic data types as follows:

  • Integer
  • Numeric (real)
  • Complex
  • Character
  • Logical (True/False)

The default representation of numbers in R is double precision real number (numeric). If you want an integer representation explicitly, you need to add the suffix L. For example, simply entering 1 on the command prompt will store 1 as a numeric object. To store 1 as an integer, you need to enter 1L. The command class(x) will give the class (type) of the object x. Therefore, entering class(1) on command prompt will give the answer numeric whereas entering class(1L) will give the answer integer.

R also has a special number Inf that represents Infinity. The number NaN (not a number) is used to represent an undefined value such as 0/0. Missing values are represented by using the symbol NA.

Data structures in R

The data structures in R can be classified as either homogeneous (all elements containing the same data type) or heterogeneous (elements containing different data types). Furthermore, each of these have different structures depending upon the number of dimensions:

  • Homogeneous:
    • Atomic vector: one-dimensional
    • Matrix: two-dimensional
    • Array: N-dimensional
  • Heterogeneous:
    • List: one-dimensional
    • Data frame: two-dimensional

The most basic object in R is a vector. To create an empty integer vector of size 10, enter the following command on the R prompt:

>v <-vector("integer",10)
>v
[1]  0000000000

You can assign the value m to nth component of the vector using the following command:

> v[5] <-1
> v
[1]  0000100000

Readers should note that unlike in many programming languages, the array index in R starts with 1 and not 0.

Whereas a vector can only contain objects of the same type, a list, although similar to the vector, can contain objects of different types. The following command will create a list containing integers, real numbers, and characters:

> l <-list(1L, 2L, 3, 4, "a", "b")
> str(l)
List of 6
$: int 1
$: int 2
$: num 3
$: num 4
$: chr "a"
$: chr "b"

Here, we used the str() function in R that shows the structure of any R object.

R has a special function c() to combine multiple numbers of basic data into a vector or list. For example, c(1,3,6,2,-1) will produce a vector containing numbers from 1,2,3,6,-1:

> c(1, 3, 6, 2, -1)
[1]  1 3 6 2 -1

A matrix is the generalization of a vector into two dimensions. Consider the following command:

>m <-matrix(c(1:9),nrow=3,ncol=3)

This command will generate a matrix m of size 3 x 3 containing numbers from 1 to 9.

The most common data structure used for storing data in R is a data frame. A data frame, like the list, can contain data of different types (numeric, integer, Boolean, or character). It is essentially a list of vectors of equal length. Therefore, it has the same two-dimensional structure as a matrix. The length (found using length( )) of a data frame is the length of the underlying list that is the number of columns in the data frame. There are simple commands nrow( ) and ncol( ) for finding the number of rows and columns of a data frame. The other two attributes of a data frame are rownames( ) and colnames( ) that can be used to either set or find the names of rows or columns.

Importing data into R

Data that is in the form of a table can be easily loaded into R using the read.table(…) function. It has several arguments to make the import very flexible. Some of the useful arguments are the following:

  • file: The name of a file or a complete URL
  • header: A logical value indicating whether the file has a header line containing names of the variables
  • sep: A character indicating the column separator field
  • row.names: A vector of row names
  • col.names: A vector of names for variables
  • skip: The number of lines in the data file to be skipped before reading the data
  • nrows: The number of rows in the dataset
  • stringsASFactors: A logical value indicating if the character variables can be coded as factors or not

For small datasets, one can use read.table("filename.txt") without specifying other arguments; the rest R will figure out itself. Another useful function is read.csv() for reading CSV files only.

In addition to loading data from text files, data can be imported into R by connecting to external databases through various interfaces. One such popular interface is Open Database Connectivity (ODBC). The RODBC package in R provides access to different databases through the ODBC interface. This package contains different functions for connecting with a database and performing various operations. Some of the important functions in the RODBC package are as follows:

  • odbcConnect(dsn, uid="user_name", pwd="password"): Used to open a connection to an ODBC database having registered data source name dsn.
  • sqlFetch(channel, sqtable): Used to read a table from an ODBC database to a data frame.
  • sqlQuery(channel, query): Used to submit a query to an ODBC database and return the results.
  • sqlSave(channel, mydf, tablename = sqtable, append = FALSE): Used to write or update (append = TRUE) a data frame to a table in the ODBC database.
  • close(channel): Used to close the connection. Here, channel is the connection handle as returned by odbcConnect.

Slicing and dicing datasets

Often, in data analysis, one needs to slice and dice the full data frame to select a few variables or observations. This is called subsetting. R has some powerful and fast methods for doing this.

To extract subsets of R objects, one can use the following three operators:

  • Single bracket [ ]: This returns an object of the same class as the original. The single bracket operator can be used to select more than one element of an object. Some examples are as follows:
    >x <-c(10,20,30,40,50)
    >x[1:3]
    [1]  10 20 30
    
    >x[x >25]
    [1]  30 40 50
    
    >f <-x >30
    >x[f]
    [1]  40 50
    
    >m <-matrix(c(1:9),nrow=3,ncol=3)
    >m[1 ,] #select the entire first row
    [1]  1 4 7
    
    >m[  ,2] #select the entire second column
    [1]  4 5 6
  • Double bracket [[ ]]: This is used to extract a single element of a list or data frame. The returned object need not be the same type as the initial object. Some examples are as follows:
    >y <-list("a", "b", "c", "d", "e")
    
    >y[1]
    [[1]]
    [1]  "a"
    
    >class(y[1])
    [1]  "list"
    
    >y[[1]]
    [1]  "a"
    
    >class(y[[1]])
    [1]  "character"
  • Dollar sign $: This is used to extract elements of a list or data frame by name. Some examples are as follows:
    >z <-list(John = 12 ,Mary = 18,Alice = 24 ,Bob = 17 ,Tom = 21)
    
    >z$Bob
    [1] 17
  • Use of negative index values: This is used to drop a particular index or column—one subset with a negative sign for the corresponding index. For example, to drop Mary and Bob from the preceding list, use the following code:
    > y <-z[c(-2, -4)]
    > y

Vectorized operations

In R, many operations, such as arithmetical operations involving vectors and matrices, can be done very efficiently using vectorized operations. For example, if you are adding two vectors x and y, their elements are added in parallel. This also makes the code more concise and easier to understand. For example, one does not need a for( ) loop to add two vectors in the code:

>x <-c(1,2,3,4,5)

>y <-c(10,20,30,40,50)

>z <-x+y

>z
[1]  11 22 33 44 55

>w <-x*y

>w
[1]  10 40 90 160 250

Another very useful example of vectorized operations is in the case of matrices. If X and Y are two matrices, the following operations can be carried out in R in a vectorized form:

>X*Y  ## Element-wise multiplication
>X/Y  ## Element-wise division
>X  %*%  Y  ## Standard matrix multiplication
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.3.175