Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Thomas MailundR Data Science Quick Referencehttps://doi.org/10.1007/978-1-4842-4894-2_2

2. Importing Data: readr

Thomas Mailund¹

(1)

Aarhus, Denmark

Before we can analyze data, we need to load it into R. The main Tidyverse package for this is called readr, and it is loaded when you load the tidyverse package .

library(tidyverse)

But you can also load it explicitly using

library(readr)

Tabular data is usually stored in text files or compressed text files with rows and columns matching the table’s structure. Each line in the file is a row in the table and columns are separated by a known delimiter character. The readr package is made for such data representation and contains functions for reading and writing variations of files formatted in this way. It also provides functionality for determining the types of data in each column, either by inferring types or through user specifications.

Functions for Reading Data

The readr package provides the following functions for reading tabular data:

Function	File format
read_csv()	Comma-separated values
read_csv2()	Semicolon-separated values
read_tsv()	Tab-separated values
read_delim()	General column delimiters¹
read_table()	Space-separated values (fixed length columns)
read_table2()	Space-separated values (variable length columns)

The interface to these functions differs little. In the following file, I describe read_csv, but I highlight when the other functions differ . The read_csv function reads data from a file with comma-separated values. Such a file could look like this:

A,B,C,D

1,a,a,1.2

2,b,b,2.1

3,c,c,13.0

Unlike the base R read.csv function, read_csv will also handle files with spaces between the columns, so it will interpret the following data the same as the preceding file.

A, B, C, D

1, a, a, 1.2

2, b, b, 2.1

3, c, c, 13.0

If you use R’s read.csv function instead, the spaces before columns B and C will be included as part of the data, and the text columns will be interpreted as factors.

The first line in the file will be interpreted as a header, naming the columns, and the remaining three lines as data rows.

Assuming the file is named data/data.csv, you read its data like this:

my_data <- read_csv(file = "data/data.csv")

## Parsed with column specification:

## cols(

## A = col_double(),

## B = col_character(),

## C = col_character(),

## D = col_double()

## )

When reading the file, read_csv will infer that columns A and D are numbers and columns B and C are strings.

If the file contains tab-separated values

A B C D

1 a a 1.2

2 b b 2.1

3 c c 13.0

you should use read_tsv() instead.

my_data <- read_tsv(file = "data/data.tsv")

The file you read with read_csv can be compressed. If the suffix of the filename is .gz, .bz2, .xz, or .zip, it will be uncompressed before read_csv loads the data.

my_data <- read_csv(file = "data/data.csv.gz")

If the filename is a URL (i.e., has prefix http://, https://, ftp://, or ftps://), the file will automatically be downloaded.

You can also provide a string as the file object:

read_csv(

"A, B, C, D

1, a, a, 1.2

2, b, b, 2.1

3, c, c, 13.0

This is rarely useful in a data analysis project, but you can use it to create examples or for debugging.

File Headers

The first line in a comma-separated file is not always the column names; that information might be available from elsewhere outside the file. If you do not want to interpret the first line as column names, you can use the option col_names = FALSE.

read_csv(

file = "data/data.csv",

col_names =FALSE

)

Since the data/data.csv file has a header, that is interpreted as part of the data, and because the header consists of strings, read_csv infers that all the column types are strings. If we did not have the header, for example, if we had the file data/data-no-header.csv:

1, a, a, 1.2

2, b, b, 2.1

3, c, c, 13.0

then we would get the same data frame as before, except that the names would be autogenerated:

read_csv(

file = "data/data-no-header.csv",

col_names =FALSE

)

## Parsed with column specification:

## cols(

## X1 = col_double(),

## X2 = col_character(),

## X3 = col_character(),

## X4 = col_double()

## )

## # A tibble: 3 x 4

## X1 X2 X3 X4

## <dbl> <chr> <chr> <dbl>

## 1 1 a a 1.2

## 2 2 b b 2.1

## 3 3 c c 13

The autogenerated column names all start with X and are followed by the number the columns have from left to right in the file.

If you have data in a file without a header, but you do not want the autogenerated names, you can provide column names to the col_names option:

read_csv(

file = "data/data-no-header.csv",

col_names = c("X", "Y", "Z", "W")

)

If there is a header line, but you want to rename the columns, you cannot just provide the names to read_csv using col_names. The first row will still be interpreted as data. This gives you data you do not want in the first row, and it also affects the inferred types of the columns.

You can, however, skip lines before read_csv parse rows as data. Since we have a header line in data/data.csv, we can skip one line and set the column names.

read_csv(

file = "data/data.csv",

col_names = c("X", "Y", "Z", "W"),

skip = 1

)

You can also put a limit on how many data rows you want to load using the n_max option.

read_csv(

file = "data/data.csv",

col_names = c("X", "Y", "Z", "W"),

skip = 1,

n_max = 2

)

If your input file has comment lines, identifiable by a character where the rest of the line should be considered a comment, you can skip them if you provide the comment option:

read_csv(

"A, B, C, D

# this is a comment

1, a, a, 1.2 # another comment

2, b, b, 2.1

3, c, c, 13.0",

comment = "#")

For more options affecting how input files are interpreted, read the function documentation: ?read_csv.

Column Types

When read_csv parses a file, it infers the type of each column. This inference can be slow, or worse the inference can be incorrect. If you know a priori what the types should be, you can specify this using the col_types option. If you do this, then read_csv will not make a guess at the types. It will, however, replace values that it cannot parse as of the right type into NA.²

String-based Column Type Specification

In the simplest string specification format, you must provide a string with the same length as you have columns and where each character in the string specifies the type of one column. The characters specifying different types are this:

Character	Type
C	Character
I	Integer
N	Number
D	Double
L	Logical
F	Factor
D	Date
T	Date time
T	Time
?	Guess (default)
_/-	Skip the column

By default, read_csv guesses , so we could make this explicit using the type specification "????":

read_csv(

file = "data/data.csv",

col_types = "????"

)

The results of the guesses are double for columns A and D and character for columns B and C. If we wanted to make this explicit, we could use "dccd".

read_csv(

file = "data/data.csv",

col_types = "dccd"

)

If you want an integer type for column A, you can use "iccd":

read_csv(

file = "data/data.csv",

col_types = "iccd"

)

If you try to interpret column D as integers as well, you will get a list of warning messages, and the values in column D will all be NA; the numbers in column D cannot be interpreted as integers and read_csv will not round them to integers.

If you specify that a column should have type d, the numbers in the column must be integers or decimal numbers. If you use the type n (the default that read_csv will guess), you will also get doubles, but the latter type can handle strings that can be interpreted as numbers such as dollar amounts, percentages, and group separators in numbers. The column type n will ignore leading and trailing text and handle number separators:

With this function call

read_csv(

'A, B, C, D, E

$1, a, a, 1.2%, "1,100,200"

$2, b, b, 2.1%, "140,000"

$3, c, c, 13.0%, "2,005,000"',

col_types = "nccnn")

columns A, D, and E will be read as numbers. If you use the type specification d, they would not, and all the values would be NA.

The decimal indicator and group delimiter vary around the world. By default, read_csv uses the US convention with a dot for decimal notation and comma for grouping in numbers. In many European countries, it is the opposite. You can use the locale option to change these:

read_csv(

'A, B, C, D, E

$1, a, a, "1,2%", "1.100.200"

$2, b, b, "2,1%", "140.000"

$3", c, c, 13,0%", "2.005.000"',

locale=locale(decimal_mark = ",",grouping_mark ="."),

col_types = "nccnn")

In the preceding example, I explicitly specified how read_csv should interpret numbers, but you can also use ISO 639-1 language codes.³ If you do, you also get the local time conventions and local day and month names. The default is English, but if your data is from Denmark, for example, you want to use Danish conventions, you would use the local locale("da"). For French data, you would use fr, locale("fr"). If you type this into an R console, you will see the month and week names, including their abbreviated forms, in these languages.

See the ?locale documentation for more options.

In files that use commas as decimal points and “.” for number groupings, the column delimiter is usually “;” rather than “,”. This way, it is not necessary to put decimal numbers in quotes. The read_csv2 function works as read_csv but uses “;” as column delimiter and “.” for number groupings.

The logical type is used for boolean values. If a column only contains TRUE and FALSE (case doesn’t matter),

read_csv(

'A, B, C, D

TRUE, a, a, 1.2

false, b, b, 2.1

true, c, c, 13')

then read_csv will guess that the type is logical.

It is not unusual to code boolean values as 0 and 1; however, and since these will be interpreted as numbers by default, you can make their type explicit using l:

read_csv(

'A, B, C, D

1, a, a, 1.2

0, b, b, 2.1

1, c, c, 13',

col_types = "lccn")

If you use type l, you can mix TRUE/FALSE (ignoring case) with 0/1. Any other number or string will be translated into NA.

The D, t, and T types are for dates, time points, and datetime, in that order. Dates and time are what you might expect. A date specifies a range of days, for example, a single day, a week, a month, or a year. A time points a specific time of the day, for example, an hour, a minute, or a second. A datetime combines a day and a time, that is, it specifies a specific time during a specific day.

read_csv(

'D, T, t

"2018-08-23", "2018-08-23T14:30", 14:30',

col_types = "DTt"

)

## # A tibble: 1 x 3

## D T t

## <date> <dttm> <drtn>

## 1 2018-08-23 2018-08-23 14:30:00 14:30

If you use one of these type specifications, the time and dates should be in ISO 8601 format.⁴ Local conventions for writing time and date, however, differ substantially and are rarely ISO 8601. When your time data are not ISO 8601, you need to tell read_csv how to read them.

The default time parser handles times in the hh:mm, hh:mm:ss formats and handles am and pm suffixes; it suffices for most time formats (but notice that it wants time in hh:mm or hh:mm:ss formats; it is flexible in the number of characters you use for hours, and you can leave out seconds, but you cannot leave out minutes). Date and datetime vary much more than time formats, and there, you usually need to specify the encoding format.

You can use the locale option to change how read_csv parses dates (D) and time (t).

read_csv(

'D, t

"23 Oct 2018", pm',

col_types = "Dt",

locale = locale(

date_format = "%d %b %Y",

time_format = "%I%p"

)

The date_format "%d %b %Y" says that dates are written as day, three-letter month abbreviation, and year with four digits, and each of the three separated by a space. The time_format "%I%p" says that we want time to be written as a number from 1 to 12, with no minute information, the hour immediately followed by am/pm without any space between.

For data times (T), we cannot specify the format using locale. We need a more verbose type specification that we return to the following specification. We also return to formatting specifications for parsing dates and time later.

Columns that are not immediately parsed as numbers, booleans, dates, or times will be parsed as strings . If you want these to be factors instead, you use the f type specification.

read_csv(

'A, B, C, D

1, a, a, 1.2

0, b, b, 2.1

1, c, c, 13',

col_types = "lcfn")

## # A tibble: 3 x 4

## A B C D

## <lgl> <chr> <fct> <dbl>

## 1 TRUE a a 1.2

## 2 FALSE b b 2.1

## 3 TRUE c c 13

If you only want to use some of the columns, you can skip the rest using the “type” - or _:

read_csv(

file = "data/data.csv",

col_type = "_cc-"

)

## # A tibble: 3 x 2

## B C

## <chr> <chr>

## 1 a a

## 2 b b

## 3 c c

If you specify the column types using a string, you should specify the types of all columns. If you only want to define the types of a subset of columns, you can use the function cols() to specify types. You call this function with named parameters, where the names are column names and the arguments are types.

read_csv(

file = "data/data.csv",

col_types = cols(A = "c")

)

read_csv(

file = "data/data.csv",

col_types = cols(A = "c", D = "c")

)

Function-based Column Type Specification

If you are like me, you might find it hard to remember the single-character codes for different types. If so, you can use longer type names that you specify using function calls. These functions have names that start with col_, so you can use autocomplete to get a list of them. The types you can specify using functions are the same as those you can specify using characters, of course, and the functions are:

Function	Type
col_character()	Character
col_integer()	Integer
col_number()	Number
col_double()	Double
col_logical()	Logical
col_factor()	Factor
col_date()	Date
col_datetime()	Date time
col_time()	Time
col_guess()	Guess (default)
col_skip()	Skip the column

You need to wrap the function-based type specifications in a call to cols.

read_csv(

file = "data/data.csv",

col_types = cols(A = col_integer())

)

## # A tibble: 3 x 4

## A B C D

## <int> <chr> <chr> <dbl>

## 1 1 a a 1.2

## 2 2 b b 2.1

## 3 3 c c 13

read_csv(

file = "data/data.csv",

col_types = cols(D = col_character())

)

## # A tibble: 3 x 4

## A B C D

## <dbl> <chr> <chr> <chr>

## 1 1 a a 1.2

## 2 2 b b 2.1

## 3 3 c c 13.0

Most of the col_ functions do not take any arguments, but they are affected by the locale parameter the same way that the string specifications are.

For factors, date, time, and datetime types, however, you have more control over the format using the col_ functions. You can use arguments to these functions for specifying how read_csv should parse dates and how it should construct factors.

For factors, you can explicitly set the levels. If you do not, then the column parser will set the levels in the order it sees the different strings in the column. For example, in data/data.csv the strings in columns C and D are in the order a, b, and c:

A, B, C, D

1, a, a, 1.2

2, b, b, 2.1

3, c, c, 13.0

By defaults, the two columns will be interpreted as characters, but if we specify that C should be a factor, we get one where the levels are a, b, and c, in that order.

my_data <- read_csv(

file = "data/data.csv",

col_types = cols(C = col_factor())

)

my_data$C

## [1] a b c

## Levels: a b c

If we want the levels in a different order, we can give col_factor() a levels argument.

my_data <- read_csv(

file = "data/data.csv",

col_types = cols(

C = col_factor(levels = c("c", "b", "a"))

)

my_data$C

## [1] a b c

## Levels: c b a

We can also make factors ordered using the ordered argument .

my_data <- read_csv(

file = "data/data.csv",

col_types = cols(

B = col_factor(ordered = TRUE),

C = col_factor(levels = c("c", "b", "a"))

)

my_data$B

## [1] a b c

## Levels: a < b < c

my_data$C

## [1] a b c

## Levels: c b a

Parsing Time and Dates

The most complex types to read (or write) are dates and time (and datetime), just because these are written in many different ways. You can specify the format that dates and datetime are in using a string with codes that indicate how time information is represented.

The codes are these:

Code	Time format	Example string	Interpretation
%Y	4-digit year	1975	The year 1975
%y	2-digit year⁵	75	Also the year 75
%m	2-digit month	02	February
%b	Abbreviated month name⁶	Feb	February
%B	Full month name	February	February
%d	2-digit day	15	The 15th of a month
%H	Hour number on a 24-hour clock	18	Six o’clock in the evening
%I	Hour number on a 12-hour clock⁷	6 pm	18:00 hours
%p	am/pm indicator	6 pm	18:00 hours
%M	Two-digit minutes	18:30	Half past six
%S	Integer seconds	18:30:10	Ten seconds past 18:00
%Z	Time zone as name⁸	America/Chicago	Central Time
%z	Time zone as o_set from UTC	“+0100”	Central European Time

There are shortcuts for frequently used formats:

Shortcut	Format
%D	%m/%d/%y
%x	%y/%m/%d
%F	%Y-%m-%d
%R	%H:%M
%T	%H:%M:%S

As we saw earlier, you can set the date and time format using the locale() function . If you do not, the default codes will be %AD for dates and %AT for time (there is no locale() argument for datetime). These codes specify YMD and H:M/H:M:S formats, respectively, but are more relaxed in matching the patterns. The date parse, for example, will allow different separators. For dates, both “1975-02-15” and “1975/02/15” will be read as February the 15th 1975, and for time, both “18:00” and “6:00 pm” will be six o’clock in the evening.

In the following text, I give a few examples. I will use the functions parse_date, parse_time, and parse_datetime rather than read_csv with column type specifications. These functions are used by read_csv when you specify a date, time, or datetime column type, but using read_csv for the examples would be unnecessarily verbose. Each takes a vector string representation of dates and time. For more examples, you can read the function documentation ?col_datetime.

Parsing time is simplest; there is not much variation in how time points are written. The main differences are in whether you use 24-hour clocks or 12-hour clocks. The %R and %T codes expect 24-hour clocks and differ in whether seconds are included or not.

parse_time(c("18:00"), format = "%R")

parse_time(c("18:00:30"), format = "%T")

There is no shortcut for 12-hour codes, but you must combine %I with %p to read pm/am formats .

parse_time(c("6 pm"), format = "%I %p")

Here, I have specified that the input only includes hours and not minutes. If we want hours (and not minutes) in 24-hour clocks, we need to use %H rather than %R.

For dates, ISO 8601 says that the format should be YYYY-MM-DD. The default date parser will accept this format, but the explicit format string is

parse_date(c("1975/02/05"), format = "%Y/%m/%d")

If you do not want to include the day, and you want to use two-digit years, you need

parse_date(c("75-02"), format = "%y-%m")

This is February 1975; remember that the %y code assumes that numbers above 68 are in the twentieth century.

Dates written on the form 05/02/75 can mean both February 15th 1975 and May 2nd 1975, depending on where you are in the world. Europe uses the sensible DD/MM/YY format, where the order goes from the smallest time unit, days, to the medium time units, months, and then to years. In the United States, they use the MM/DD/YY format. To get the 15th of February, you need one of these formats:

parse_date(c("75/02/05"), format = "%y/%m/%d")

parse_date(c("75/05/02"), format = "%y/%d/%m")

Date specifications that only use numbers are not affected by the local language, but if you include the name of months, they are. The name of months and their abbreviation varies from language to language, obviously. So do the name of weekdays, but at the time of writing, parsing weeks and weekdays is not supported by readr . You can get the name information from locale() if you use a language code. In the following examples, I parse dates in English and Danish. The month names are almost the same, but abbreviations in Danish require a dot following them, and the day is followed by a dot as well.

parse_date(c("Feb 15 1975"),

format = "%b %d %Y", locale = locale("en"))

parse_date(c("15. feb. 1975"),

format = "%d. %b %Y", locale = locale("da"))

parse_date(c("February 15 1975"),

format = "%B %d %Y", locale = locale("en"))

parse_date(c("15. feb. 1975"),

format = "%d. %b %Y", locale = locale("da"))

parse_date(c("Oct 15 1975"),

format = "%b %d %Y", locale = locale("en"))

parse_date(c("15. okt. 1975"),

format = "%d. %b %Y", locale = locale("da"))

parse_date(c("October 15 1975"),

format = "%B %d %Y", locale = locale("en"))

parse_date(c("15. oktober 1975"),

format = "%d. %B %Y", locale = locale("da"))

Datetimes can be parsed using combinations of date and time strings. With these, you also want to consider time zones. You can ignore those for dates and time, but unless you are sure that you will never have to consider time zones, you should not rely on the default time zone (which is UTC).⁹

You can either specify that time zones are relative to UTC with %z or location-based, with %Z if the time zone is given in the input, or you can use locale() if it is the same for all the input.

If you specify a time zone based on a location, R will automatically adjust for daylight saving time, but if you use dates relative to UTC, you will not—UTC does not have daylight savings. Central European Time (CET) is “+0100” and with daylight saving time “+0200”. US Pacific Time is “-0800”, but with daylight saving time, it is “-0900”. When you switch back and forth between daylight savings is determined by your location.

These two datetimes are the same

parse_datetime(c("Feb 15 1975 18:00 US/Pacific"),

format = "%b %d %Y %R %Z")

parse_datetime(c("Feb 15 1975 18:00 -0800"),

format = "%b %d %Y %R %z")

as are these two

parse_datetime(c("May 15 1975 18:00 US/Pacific"),

format = "%b %d %Y %R %Z")

parse_datetime(c("May 15 1975 18:00 -0900"),

format = "%b %d %Y %R %z")

If you use locale() to specify a time zone, you cannot use zones relative to UTC. The point of using locale() is local formats, not time zones. The parser will still handle daylight savings for you, however. These two are the same datetimes

parse_datetime(c("Aug 15 1975 18:00"),

format = "%b %d %Y %R",

locale = locale(tz = "US/Pacific"))

parse_datetime(c("Aug 15 1975 18:00 US/Pacific"),

format = "%b %d %Y %R %Z")

If you print the objects you parse, there is a difference between using locale() and using %Z, but the time will be the same. Using %Z you will automatically translate the time into UTC; using locale() you will not.

x <- parse_datetime(c("Aug 15 1975 18:00"),

format = "%b %d %Y %R",

locale = locale(tz = "US/Pacific"))

## [1] "1975-08-15 18:00:00 PDT"

y <- parse_datetime(c("Aug 15 1975 18:00 US/Pacific"),

format = "%b %d %Y %R %Z")

## [1] "1975-08-16 01:00:00 UTC"

x == y

## [1] TRUE

The output looks like a string, but the object classes are not character, which, among other things, is why the comparison works.

Space-separated Columns

The preceding functions all read delimiter-separated columns. They expect a single character to separate one column from the next. If the argument trim_ws is true, they ignore whitespace. This argument is true by default for read_csv, read_csv2, and read_tsv, but false for read_delim.

The functions read_table and read_table2 take a different approach and separate columns by one or more spaces. The simplest of the two is read_table2. It expects any sequence of whitespace to separate columns. Consider

read_table2(

"A B C D

1 2 3 4

15 16 17 18"

)

## # A tibble: 2 x 4

## A B C D

## <dbl> <dbl> <dbl> <dbl>

## 1 1 2 3 4

## 2 15 16 17 18

The header names are separated by two spaces. The first data line has spaces before the first line since the string is indented the way it is. Between columns, there are also two spaces . For the second data line, we have several spaces before the first value, once again, but this time only single space between the columns. If we used a delimiter character to specify that we wanted a space to separate columns, we had to have exactly the same number of spaces between each column.

The read_table function instead reads the data as fixed-width columns. It uses the whitespace in the file to figure out the width of the columns. After this, each line will be split into characters that match the width of the columns and assigned to those columns.

For example, in

read_table(

A B C D

121 xyz 14 15

22 abc 24 25

)

## # A tibble: 2 x 4

## A B C D

## <dbl> <chr> <dbl> <dbl>

## 1 121 xyz 14 15

## 2 22 abc 24 25

the columns are aligned, and the rows are interpreted as we might expect. Aligned, here, means that we have aligned spaces at some position between the columns. If you do not have spaces at the same location in all rows, columns will be merged.

read_table(

A B C D

121 xyz 14 15

22 abc 24 25

)

## # A tibble: 2 x 3

## A B `C D`

## <dbl> <chr> <chr>

## 1 121 xyz 14 15

## 2 22 abc 24 25

Here, the header C is at the position that should separate columns C and D, and these columns are therefore merged.

If you have spaces in all rows but data between them in some columns only, you will get an error . For example, if your data looks like this

read_table(

A B C D

121 xyz x 14 15

22 abc 24 25

)

where the x in the first data line sits between two all-space columns. If you need more specialized fixed-width files, you might want to consider the read_fwf function. See its documentation for details: ?read_fwf.

The read_table and read_table2 functions take many of the same arguments as the delimiter-based parser, so you can, for example, specify column types and set the locale in the same way as the preceding data.

Not part of the main Tidyverse, the packages loaded when you load the package tidyverse, is readxl. Its read_excel function does exactly what it says on the tin; it reads Excel spreadsheets into R. Its interface is similar to the functions in readr. Where the interface differs is in Excel specific options such as which sheet to read. Such options are clearly only needed when reading Excel files.

Functions for Writing Data

Writing data to a file is more straightforward than reading data because we have the data in the correct types and we do not need to deal with different formats. With readr’s writing functions, we have fewer options to format our output—for example, we cannot give the functions a locale() and we cannot specify date and time formatting, but we can use different functions to specify delimiters and time will be output in ISO 8601 which is what the reading functions will use as default.

The functions are write_delim(), write_csv(), write_csv2(), and write_tsv(), and for formats that Excel can read, write_excel_csv() and write_excel_csv2(). The difference between write_csv() and write_excel_csv() and between write_csv2() and write_excel_csv2() is that the Excel functions include a UTF-8 byte order mark so Excel knows that the file is UTF-8 encoded.

The first argument to these functions is the data we want to write and the second is the path to the file we want to write to. If this file has suffix .gz, .bz2, or .xz, the output is automatically compressed.

I will not list all the arguments for these functions here, but you can read the documentation for them from the R console. The argument you are most likely to use is col_names which, if true, means that the function will write the column names as the first line in the output, and if false, will not. If you use write_delim(), you might also want to specify the delimiter character using the delim argument . By default it is a single space; if you write to a file using write_delim() with the default options, you get the data in a format that you can read using read_table2().

The delimiter characters and the decimal points for write_csv(), write_csv2(), and write_tsv are the same as for the corresponding read functions.

Footnotes

The read_delim() can handle any file format that has a special character that delimits columns. The read_csv(), read_csv2(), and read_tsv() functions are specializations of it. The first of these uses commas for the delimiter, the second semicolons, and the third tabs.

There is a gotcha here. The types are guessed at after a fixed number of lines are read (by default 1000). If you have 1000 lines of numbers in a column and line 1001 has a string, then the type will be inferred as numeric and you lose the string. If you know the types, it is always better to tell the functions what they are.

https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

https://en.wikipedia.org/wiki/ISO_8601

Two-digit years are assumed to be either in the twentieth or the twenty-first century. The cutoff line is 68; years at or below 68 are in the twentieth century and 69 and above are in the twenty-first century. Therefore, 75 is assumed to be in the twentieth century, so 75 is 1975.

The name of months and weekdays varies from language to language, and so does the abbreviations. Therefore, if you use a format that refers to these, you either need to use numbers or the format will depend on the locale option.

If you use %I, you must also use %p for pm/am.

Time zones based on locations handle daylight saving time. UTC is a fixed time zone, and it does not have daylight saving time.

https://en.wikipedia.org/wiki/Coordinated_Universal_Time

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2. Importing Data: readr

Create new playlist

Sign In

Sign Up

2. Importing Data: readr

Functions for Reading Data

File Headers

Column Types

String-based Column Type Specification

Function-based Column Type Specification

Parsing Time and Dates

Space-separated Columns

Functions for Writing Data

Table of Contents for
2. Importing Data: readr