Parsing Comma-Separated Data

Problem

You have a string or a file of lines containing comma-separated values (CSV) that you need to read in. Many MS-Windows-based spreadsheets and some databases use CSV to export data.

Solution

Use my CSV class or a regular expression (see Chapter 4).

Discussion

CSV is deceptive. It looks simple at first glance, but the values may be quoted or unquoted. If quoted, they may further contain escaped quotes. This far exceeds the capabilities of the StringTokenizer class (Section 3.3). Either considerable Java coding or the use of regular expressions is required. I’ll show both ways.

First, a Java program. Assume for now that we have a class called CSV that has a no-argument constructor, and a method called parse( ) that takes a string representing one line of the input file. The parse( ) method returns a list of fields. For flexibility, this list is returned as an Iterator (see Section 7.5). I simply use the Iterator’s hasNext( ) method to control the loop, and its next( ) method to get the next object.

import java.util.*;

/* Simple demo of CSV parser class.
 */
public class CSVSimple {    
    public static void main(String[] args) {
        CSV parser = new CSV(  );
        Iterator it = parser.parse(
            ""LU",86.25,"11/4/1998","2:19PM",+4.0625");
        while (it.hasNext(  )) {
            System.out.println(it.next(  ));
        }
    }
}

After the quotes are escaped, the string being parsed is actually the following:

"LU",86.25,"11/4/1998","2:19PM",+4.0625

Running CSVSimple yields the following output:

> java CSVSimple
LU
86.25
11/4/1998
2:19PM
+4.0625
>

But what about the CSV class itself? Oh yes, here it is. This is my translation of a CSV program written in C++ by Brian W. Kernighan and Rob Pike that appeared in their book The Practice of Programming. Their version commingled the input processing with the parsing; my CSV class does only the parsing, since the input could be coming from any of a variety of sources. The main work is done in parse( ), which delegates handling of individual fields to advquoted( ) in cases where the field begins with a quote, and otherwise to advplain( ).

import com.darwinsys.util.*;
import java.util.*;

/** Parse comma-separated values (CSV), a common Windows file format.
 * Sample input: "LU",86.25,"11/4/1998","2:19PM",+4.0625
 * <p>
 * Inner logic adapted from a C++ original that was
 * Copyright (C) 1999 Lucent Technologies
 * Excerpted from 'The Practice of Programming'
 * by Brian W. Kernighan and Rob Pike.
 * <p>
 * Included by permission of the http://tpop.awl.com/ web site, 
 * which says:
 * "You may use this code for any purpose, as long as you leave 
 * the copyright notice and book citation attached." I have done so.
 * @author Brian W. Kernighan and Rob Pike (C++ original)
 * @author Ian F. Darwin (translation into Java and removal of I/O)
 */
public class CSV {    

    public static final String SEP = ",";

    /** Construct a CSV parser, with the default separator (`,'). */
    public CSV(  ) {
        this(SEP);
    }

    /** Construct a CSV parser with a given separator. Must be
     * exactly the string that is the separator, not a list of
     * separator characters!
     */
    public CSV(String sep) {
        fieldsep = sep;
    }

    /** The fields in the current String */
    protected ArrayList list = new ArrayList(  );

    /** the separator string for this parser */
    protected String fieldsep;

    /** parse: break the input String into fields
     * @return java.util.Iterator containing each field 
     * from the original as a String, in order.
     */
    public Iterator parse(String line)
    {
        StringBuffer sb = new StringBuffer(  );
        list.clear(  );            // discard previous, if any
        int i = 0;

        if (line.length(  ) == 0) {
            list.add(line);
            return list.iterator(  );
        }

        do {
            sb.setLength(0);
            if (i < line.length(  ) && line.charAt(i) == '"')
                i = advquoted(line, sb, ++i);    // skip quote
            else
                i = advplain(line, sb, i);
            list.add(sb.toString(  ));
            i++;
        } while (i < line.length(  ));

        return list.iterator(  );
    }

    /** advquoted: quoted field; return index of next separator */
    protected int advquoted(String s, StringBuffer sb, int i)
    {
        int j;

        // Loop through input s, handling escaped quotes
        // and looking for the ending " or , or end of line.

        for (j = i; j < s.length(  ); j++) {
            // found end of field if find unescaped quote.
            if (s.charAt(j) == '"' && s.charAt(j-1) != '') {
                int k = s.indexOf(fieldsep, j);
                Debug.println("csv", "j = " + j + ", k = " + k);
                if (k == -1) {    // no separator found after this field
                    k += s.length(  );
                    for (k -= j; k-- > 0; ) {
                        sb.append(s.charAt(j++));
                    }
                } else {
                    --k;    // omit quote from copy
                    for (k -= j; k-- > 0; ) {
                        sb.append(s.charAt(j++));
                    }
                    ++j;    // skip over quote
                }
                break;
            }
            sb.append(s.charAt(j));    // regular character.
        }
        return j;
    }

    /** advplain: unquoted field; return index of next separator */
    protected int advplain(String s, StringBuffer sb, int i)
    {
        int j;

        j = s.indexOf(fieldsep, i); // look for separator
        Debug.println("csv", "i = " + i + ", j = " + j);
        if (j == -1) {                   // none found
            sb.append(s.substring(i));
            return s.length(  );
        } else {
            sb.append(s.substring(i, j));
            return j;
        }
    }
}

In the online source directory you’ll find CSVFile.java, which reads a file a line at a time and runs it through parse( ). You’ll also find Kernighan and Pike’s original C++ program.

We haven’t discussed regular expressions yet (we will in Chapter 4). However, many readers will be familiar with REs in a general way, so the following example will demonstrate the power of REs as well as provide code for you to reuse. Note that this program replaces all the code in both CSV.java and CSVFile.java. The key to understanding REs is that a little specification can match a lot of data.

import com.darwinsys.util.Debug;
import java.io.*;
import org.apache.regexp.*;

/* Simple demo of CSV matching using Regular Expressions.
 * Does NOT use the "CSV" class defined in the Java CookBook.
 * RE Pattern from Chapter 7, Mastering Regular Expressions (p. 205, first edn.)
 */
public class CSVRE {    
    /** The rather involved pattern used to match CSV's consists of three
     * alternations: the first matches quoted fields, the second unquoted,
     * the third null fields
     */
    public static final String CSV_PATTERN =
        ""([^"\\]*(\\.[^"\\]*)*)",?|([^,]+),?|,";

    public static void main(String[] argv) throws IOException, RESyntaxException
    {
        String line;
    
        // Construct a new Regular Expression parser.
        Debug.println("regexp", "PATTERN = " + CSV_PATTERN); // debug
        RE csv = new RE(CSV_PATTERN);

        BufferedReader is = new BufferedReader(new InputStreamReader(System.in));

        // For each line...
        while ((line = is.readLine(  )) != null) {
            System.out.println("line = `" + line + "'");

            // For each field
            for (int fieldNum = 0, offset = 0; csv.match(line, offset);
                fieldNum++) {

                // Print the field (0=null, 1=quoted, 3=unquoted).
                int n = csv.getParenCount(  )-1;
                if (n==0)    // null field
                    System.out.println("field[" + fieldNum + "] = `'");
                else
                    System.out.println("field[" + fieldNum + "] = `" +
                        csv.getParen(n) + "'");

                // Skip what already matched.
                offset += csv.getParen(0).length(  );
            }
        }
    }
}

It is sometimes downright scary how much mundane code you can eliminate with a single, well-formulated regular expression.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.151.44