Wikipedia defines the
parsing process
as follows:
“The process of analyzing a string of symbols, either in natural language, computer languages, or data structures, conforming to the rules of a formal grammar. The term is used to refer to the formal analysis by a computer of a sentence or other string of words into its constituents, resulting in a parse tree showing their syntactic relation to each other, which may also contain semantic and other information.”
This sounds a little complex, and it really is, involving grammars, parsing trees, some tokens, expressions, terms, and so on. In this chapter, you will learn something easier: how to parse binary data in Haskell.
This chapter is inspired from [2] and will use the examples provided there. Note that the examples may not work with newer versions of ghc. To run them, use ghc 8.0.2.
When you have a string of bytes, you will want to do something with them. A package that deals with binary data in Haskell is called
Binary
. First you need to install it, so open a terminal and type the following:
The
Binary package
has three main components.
Get is a state monad, so it keeps a state and changes it when an action is applied on that state. In this situation, the state is actually an offset of the lazy string of bytes that will be parsed. In scenarios where you get a failure when trying to parse a bytestring, the failure will be interpreted as an exception—these are handled in the IO monad
due to laziness—which will probably be thrown in a different place than you expect. In such situations, you can use a stricter version of the Get monad
.
Here is an example with a regular
Get:
import qualified Data.ByteString.Lazy as BL
import Data.Binary.Get
import Data.Word
deserialiseHeader :: Get (Word32, Word32, Word32)
deserialiseHeader = do
alen <- getWord32be
plen <- getWord32be
chksum <- getWord32be
return (alen, plen, chksum)
main :: IO ()
main = do
input <- BL.getContents
print $ runGet deserialiseHeader input
The inputs are three 32-bit numbers in big-endian format, and the output is a
tuple
.
% runhaskell /ch17/parsing.hs << EOF
heredoc> 123412341235
heredoc> EOF
(825373492,825373492,825373493)
If the input is too short, you get an exception.
% runhaskell / ch17/parsing.hs << EOF
tooshort
EOF
parsing.hs: Data.Binary.Get.runGet at position 8: not enough bytes
CallStack (from HasCallStack):
error, called at libraries/binary/src/Data/Binary/Get.hs:351:5 in binary-0.8.6.0:Data.Binary.Get
In the previous example, the function getWord32 takes the input
, goes through it, and returns a value.
The next example decodes a list of numbers that end with
EOF in a recursive manner:
listOfWord16 = do
empty <- isEmpty
if empty
then return []
else do v <- getWord64be
rest <- listOfWord16
return (v : rest)
You saw in the first example that you get an exception when the input is too short. There are two ways to handle exceptions: write your own parser or handle the exceptions in the
IO monad
. The second way is simpler because it involves a stricter version of the
Get monad, in which the parser
Get a taken as input to
runGet results in
(Either String a, ByteString). This means the first value is either a string (i.e., the exception message) or the result, while the second value is the remaining
bytestring
. The following is the modified version of the first example with the strict
Get being used:
import qualified Data.ByteString as B
import Data.Binary.Strict.Get
import Data.Word
deserialiseHeader :: Get (Word32, Word32, Word32)
deserialiseHeader = do
alen <- getWord32be
plen <- getWord32be
chksum <- getWord32be
return (alen, plen, chksum)
main :: IO ()
main = do
input <- B.getContents
print $ runGet deserialiseHeader input
Note that this example requires
binary-strict, which needs to be installed using
cabal. The changes in the code are that it is using a strict bytestring instead of a lazy
bytestring
and it is importing
Data.Binary.Strict.Get. If the example is run again, you will obtain:
% runhaskell /ch17/parsing.hs << EOF
heredoc> 123412341235
heredoc> EOF
(Right (825373492,825373492,825373493),"
")
Now it works correctly because the output is a
Right, and a new line was added instead of being consumed by the parser. Let’s run it with the shorter input, as shown here:
% runhaskell /ch17/parsing.hs << EOF
heredoc> tooshort
heredoc> EOF
(Left "too few bytes","
")
Now, the output is a Left instead of an exception and can be handled in the IO monad. The fail can be called inside the parser.
Operations on
bits
are allowed. Importing
Data.Bits, the following operators can be used:
Operator | Symbol |
---|
AND
|
.&.
|
OR
|
.|.
|
XOR
|
`xor`
|
NOT
|
`complement`
|
Left shift
|
`shiftL`
|
Right shift
|
`shiftR`
|
Don’t forget that working with bits needs special attention
.
Summary
In this chapter, you learned how to parse a bytestring using binary and binary-strict.
For a parser created from scratch, check out [3].
References
- 1.
G. Hutton and E. Meijer, “Monadic Parsing in Haskell,” Journal of Functional Programming, 8.4: 437–444 (1998)
- 2.
- 3.