Reading and writing CSV files

A CSV file is a comma-separated file. The data fields in each line are separated by commas, ,, or another delimiter, such as semicolons, ;. These files are the de-facto standard for exchanging small and medium amounts of tabular data. Such files are structured so that one line contains data about one data object, so we need a way to read and process the file line by line. As an example, we will use the Chapter 8winequality.csv datafile, which contains 1,599 sample measurements, 12 data columns, such as pH and alcohol, per sample, separated by a semicolon. In the following screenshot, you can see the top 20 rows:

In general, the readdlm function from the DelimitedFiles package is used to read in the data from the CSV files:

# code in Chapter 8csv_files.jl: 
fname = "winequality.csv" 
using DelimitedFiles
data = DelimitedFiles.readdlm(fname, ';')

The second argument is the delimiter character (here, it is ;). The resulting data is a 1600x12 Array{Any,2} array of the Any type because no common type could be found:

 "fixed acidity"   "volatile acidity"      "alcohol"   "quality"
   7.4                0.7                     9.4         5.0
   7.8                0.88                    9.8         5.0
   7.8                0.76                    9.8         5.0
...

The problem with what we have done so far is that the header (the column titles) was read as part of the data. Fortunately, we can pass the header=true argument to let Julia put the first line in a separate array. It then naturally gets the correct datatype, Float64, for the data array. We can also specify the type explicitly, such as this:

data3 = DelimitedFiles.readdlm(fname, ';', Float64, '
', header=true)

The third argument here is the type of data, which is a numeric type, String or Any. The next argument is the line-separator character, and the fifth indicates whether or not there is a header line with the field (column) names. If so, then data3 is a tuple with the data as the first element and the header as the second, in our case, ([7.4 0.7 ... 9.4 5.0; 7.8 0.88 ... 9.8 5.0; ... ; 5.9 0.645 ... 10.2 5.0; 6.0 0.31 ... 11.0 6.0], AbstractString["fixed acidity" "volatile acidity" ... "alcohol" "quality"]) (there are other optional arguments to define readdlm; use ? DelimitedFiles.readdlm). In this case, the actual data is given by data3[1] and the header by data3[2].

Let's continue working with variable data. The data forms a matrix, and we can get the rows and columns of data using the normal array-matrix syntax (refer to the Matrices section in Chapter 5, Collection Types). For example, the third row is given by row3 = data[3, :] with data: 7.8 0.88 0.0 2.6 0.098 25.0 67.0 0.9968 3.2 0.68 9.8 5.0, representing the measurements for all the characteristics of a certain wine.

The measurements of a certain characteristic for all wines are given by a data column; for example, col3 = data[ :, 3] represents the measurements of citric acid and returns a 1600-element Array{Any,1}: "citric acid" 0.0 0.0 0.04 0.56 0.0 0.0 ... 0.08 0.08 0.1 0.13 0.12 0.47 column vector.

If we need columns two to four (volatile acidity to residual sugar) for all wines, extract the data with x = data[:, 2:4]. If we need these measurements only for the wines on rows 70-75, get these with y = data[70:75, 2:4], returning a 6 x 3 Array{Any,2} output, as follows:

0.32   0.57  2.0
0.705  0.05  1.9
...
0.675  0.26  2.1

To get a matrix with the data from columns 3, 6, and 11, execute the following command:

z = [data[:,3] data[:,6] data[:,11]]

This includes the headers; if you don't want these, use the following:

z = [data[2:end,3] data[2:end,6] data[2:end,11]]

It would be useful to create a Wine type in the code.

For example, if the data is to be passed around functions, it will improve the code quality to encapsulate all the data in a single data type, like this:

struct Wine
    fixed_acidity::Array{Float64}
    volatile_acidity::Array{Float64}
    citric_acid::Array{Float64}
    # other fields
    quality::Array{Float64} 
end

Then, we can create objects of this type to work with them, like in any other object-oriented language, for example, wine1 = Wine(data[1, :]...), where the elements of the row are splatted with the ... operator into the Wine constructor.

To write to a CSV file, the simplest way is to use the writecsv function for a comma separator, or the writedlm function if you want to specify another separator. For example, to write an array data to a partial.dat file, you need to execute the following command:

writedlm("partial.dat", data, ';')

If more control is necessary, you can easily combine the more basic functions from the previous section. For example, the following code snippet writes 10 tuples of three numbers each to a file:

// code in Chapter 8	uple_csv.jl 
fname = "savetuple.csv" 
csvfile = open(fname,"w") 
# writing headers: 
write(csvfile, "ColName A, ColName B, ColName C
") 
for i = 1:10 
  tup(i) = tuple(rand(Float64,3)...) 
  write(csvfile, join(tup(i),","), "
") 
end 
close(csvfile)

Table of Contents for Reading and writing CSV files

Create new playlist

Sign In

Sign Up

Table of Contents for
Reading and writing CSV files