Parsing custom data formats

If you work with data for long enough, you'll eventually come across data that you can't find a library for, and you'll need to write your own parser. Some formats might be simple enough for regular expressions, but if you need to balance syntactic structures in the input, or handle anything too complicated, you're probably better off creating a custom parser. Sometimes, custom parsers can be slower than regular expressions for very large inputs, but sometimes they're still your best option.

Clojure, and most functional languages, are great for parsing, and many have parser-combinator libraries that make writing parsers extremely simple.

For this recipe, as an example of a data format that needs parsing, we'll work with some FASTA data (http://en.wikipedia.org/wiki/FASTA_format). FASTA is a file format that's used in bioinformatics to exchange nucleotide and peptide sequences. Of course, there are parsers already for this, but it's a simple, yet non-trivial format, which makes a good example case for this recipe.

The first line of FASTA data starts with a > character followed by a unique identifier. This line often contains other information about the specimen described, the database it came from, and more. After this line comes one or more lines that list the sequence information. A more detailed explanation of the FASTA format is available at http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml. This is how an example FASTA record looks:

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPY
IGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIP
FHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRS
VPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPY
TIIGQMASILYFSIILAFLPIAGXIENY

We'll use the parse-ez library (https://github.com/protoflex/parse-ez) to build the parser.

Getting ready

We need to make sure that parse-ez is listed in our Leiningen project.clj file:

(defproject cleaning-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [parse-ez "0.3.6"]])

We also need to make it available to our script or REPL:

(use 'protoflex.parse)

How to do it…

To define a parser, we just define functions that parse the different parts of the input and then combine them to parse larger structures:

  1. It would be useful to have a way to parse two things and throw away the results of the second. This function will do that:
    (defn <| [l r]
      (let [l-output (l)]
        (r)
        l-output))
  2. Also, we'll define a parser for the end of a line. It matches either a carriage return or a new line:
    (defn nl []
      (chr-in #{
    ewline 
    eturn}))
  3. Let's start putting the pieces together. The first function parses the sequence definition line by accepting a > character, followed by anything up to the end of the line:
    (defn defline []
      (chr >)
      (<| #(read-to-re #"[
    
    ]+") nl))
  4. We parse a sequence of amino acid or nucleic acid codes by defining a parser for a single code and then building on that to create a parser for a line of code:
    (defn acid-code []
      (chr-in #{A B C D E F G H I K L M
                N P Q R S T U V W X Y 
                - *}))
    
    (defn acid-code-line []
      (<| #(multi+ acid-code) #(attempt nl)))
  5. Next, we combine these parsers into one that parses an entire FASTA record and populates a map with our data. Moreover, we define a combinator that parses multiple FASTA records:
    (defn fasta []
      (ws?)
      (let [dl (defline)
            gls (apply str (flatten
                             (multi+ acid-code-line)))]
        {:defline dl, :gene-seq gls}))
    (defn multi-fasta []
      (<| #(multi+ fasta)
          ws?))
  6. Finally, we create a wrapper function that passes our parser to parse-ez and configures the parsing engine the way we need it:
    (defn parse-fasta [input]
      (parse multi-fasta input
             :eof false :auto-trim false))

Now, we can use this function to parse the example record, provided at the beginning of this recipe:

user=> parse-fasta test-data)
{:defline
 "gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]",
 :gene-seq
"LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLVEWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLGLLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVILGLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGXIENY"}

How it works…

At the most abstract level, parsers are functions. They take a string as input and return a data structure. More complex, advanced parsers are built by combining simpler elements.

The <| function is a good example of this. It does not parse anything by itself. However, this function makes it possible to combine two other parsers in a useful way: it parses both parts and throws away the results of the second.

The acid-code function is an example of how to create a parser from a basic component. It matches any of the characters in the set.

acid-code-line then combines the acid-code parser. It has to match one or more acid-code characters, optionally followed by a newline. It uses the <| combinator to throw away the newline and return the sequence of acid-codes.

This entire parser is built up in this way, by composing complex structures from simple parts. While this is a very basic parser, it's possible to create quite complex parsers in this way, leveraging the full power of Clojure while keeping the code readable and maintainable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.60.63