Lesson 26. Capstone: Processing binary files and book data

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Lesson 26. Capstone: Processing binary files and book data

This capstone covers

Learning about a unique binary format used by libraries
Writing tools to bulk-process binary data by using ByteString
Working with Unicode data by using the Text type
Structuring a large program performing a complicated I/O task

In this capstone, you’re going to use the data on books created by libraries to make a simple HTML document. Libraries collectively spend a huge amount of time cataloging every possible book in existence. Thankfully, much of this data is freely available to anyone who wants to explore it. Harvard Library alone has released 12 million book records to be used for free by the public (http://library.harvard.edu/open-metadata). The Open Library project contains millions of additional records for use (https://archive.org/details/ol_data).

In a time when data science is a hot trend, it would be great to make some fun projects with all this data. But there’s a big challenge to using this data. Libraries store their book-related metadata in a rather obscure format called a MARC record (for Machine-Readable Cataloging record). This makes using library data much more challenging than if it were in a more common format such as JSON or XML. MARC records are in a binary format that also makes heavy use of Unicode to properly store character encodings. To use MARC records, you have to be careful about separating when you’re working with bytes from when you’re working with text. This is a perfect problem to explore all you’ve learned in this unit!

Our goal for this capstone is to take a collection of MARC records and convert it into an HTML document that lists the titles and authors of every book in the collection. This will leave you with a solid foundation to further explore extracting data from MARC records:

You’ll start your journey by creating a type for the book data you want to store and converting that to HTML.
Next you have to learn how MARC records are formatted.
Then you’ll break apart a bulk of records serialized into a single file into a list of individual records.
Once you’ve split the records up, you’ll be able to parse individual files to find the information you need.
Finally, you’ll put all of this together into a single program that will process your MARC records into HTML files.

You’ll be writing all of your code in a single file, marc_to_html.hs. To get started, you’ll need the following imports (plus your OverloadedStrings extension).

Listing 26.1. The necessary imports for marc_to_html.hs

{-# LANGUAGE OverloadedStrings #-}                      1
import qualified Data.ByteString as B                   2
import qualified Data.Text as T                         3
import qualified Data.Text.IO as TIO                    4
import qualified Data.Text.Encoding as E                5
import Data.Maybe                                       6

1 Your OverloadStrings LANGUAGE pragma so you can use string literals for all string types
2 Because you’re working with binary data, you need a way to manipulate bytes.
3 Anytime you’re working with text, especially Unicode, you need the Text type.
4 The IO functions for Text are imported separately.
5 Part of working with Unicode is safely encoding and decoding it to and from binary data.
6 You’ll be using Maybe types, as the isJust function from the Maybe package is useful.

You may have noticed that you’re not importing Data.ByteString.Char8. This is because when working with Unicode data, you never want to confuse Unicode text with ASCII text. The best way to ensure this is to use plain old ByteStrings for manipulating bytes and Text for everything else.

26.1. Working with book data

Unpacking MARC records is going to be a bit of work, so it’s good to figure out where you want to end up before you get lost. Your primary goal is to convert a list of books into an HTML document. The books being in an obscure format is one obstacle to our goal. In this capstone, you’re concerned with recording only the author and title of the books. You can use a type synonym for these properties. You could use String, but as mentioned in lesson 23, as a general rule it’s much better to use Text when dealing with any large task consisting mostly of text data.

Now you can create your type synonyms for Author and Title.

Listing 26.2. Type synonyms for `Author` and `Title`

type Author = T.Text
type Title = T.Text

Your Book type will be the product type of Author and Title.

Listing 26.3. Create a `Book` type

data Book = Book {
    author :: Author
   ,title :: Title } deriving Show

Your final function for this will be called booksToHtml and will have the type [Books] -> Html. Before implementing this function, you first need to determine what type Html will be, and ideally how to make an individual book into a snippet of HTML. You can use the Text type once again to model your HTML.

Listing 26.4. `Html` type synonym

type Html = T.Text

To make transforming a list of books easier to turn into HTML, you’ll start with creating a snippet of HTML for a single book. Your HTML will create a paragraph element, and then denote the title with a <strong> tag and the author with an <em> tag.

Listing 26.5. `bookToHtml` creates an individual snippet of HTML from a book

bookToHtml :: Book -> Html
bookToHtml book = mconcat ["<p>
"
                      ,titleInTags
                      ,authorInTags
                      ,"</p>
"]
   where titleInTags = mconcat["<strong>",(title book),"</strong>
"]
         authorInTags = mconcat["<em>",(author book),"</em>
"]

Next you need some sample books you can work with.

Listing 26.6. A collection of sample books

book1 :: Book
book1 = Book {
    title = "The Conspiracy Against the Human Race"
   ,author = "Ligotti, Thomas"
   }

book2 :: Book
book2 = Book {
    title = "A Short History of Decay"
   ,author = "Cioran, Emil"
   }

book3 :: Book
book3 = Book {
    title = "The Tears of Eros"
   ,author = "Bataille, Georges"
   }

In GHCi, you can test this bit of code:

GHCi> bookToHtml book1
"<p>
<strong>The Conspiracy Against the Human Race</strong>
<em>Ligotti,
Thomas</em>
</p>
"

To transform a list of books, you can map your bookToHtml function over the list. You also need to make sure you add html, head, and body tags as well.

Listing 26.7. Turning a list of books into an HTML document with `booksToHtml`

booksToHtml :: [Book] -> Html
booksToHtml books = mconcat ["<html>
"
                             , "<head><title>books</title>"
                             ,"<meta charset='utf-8'/>"                1
                             ,"</head>
"]
                             , "<body>
"
                             , booksHtml
                             , "
</body>
"
                             , "</html>"]
   where booksHtml = (mconcat . (map bookToHtml)) books

1 Because you’re dealing with Unicode data, it’s important to declare your charset.

To test this out, you can put your books in a list:

myBooks :: [Book]
myBooks = [book1,book2,book3]

Finally, you can build a main and test out your code so far. You’ll assume you’re writing to a file called books.html. Remember that your Html type is Text. To write text to a file, you’ll also need to include Text.IO.

Listing 26.8. Temporary `main` to write your books list to HTML

main :: IO ()
main = TIO.writeFile "books.html" (booksToHtml  myBooks)

Running this program will output your books.html file. Opening it up, you can see that it looks like you’d expect (see figure 26.1).

Figure 26.1. Your book data rendered as HTML

With the ability to write books to a file, you can tackle the more complicated issue of working with MARC records.

26.2. Working with MARC records

The MARC record is the standard used in libraries for recording and transmitting information about books (called bibliographic data). If you’re interested in data about books, MARC records are an important format to understand. There are many large, freely available collections of MARC records online. You’ll be using the Oregon Health & Science University library records in this capstone. As noted earlier, MARC stands for Machine-Readable Cataloging record. As indicated by the name, MARC records are designed to be read by machines. Unlike formats such as JSON and XML, they aren’t designed to be human-readable. If you open a MARC record file, you’ll see something that looks like figure 26.2.

Figure 26.2. The content of a raw MARC record

If you’ve ever worked with the ID3 tag format for storing MP3 metadata, you’ll find MARC records are similar.

26.2.1. Understanding the structure of a MARC record

The MARC record standard was developed in the 1960s with the primary aim of making it efficient to store and transmit information. Because of this, MARC records are much less flexible and extensible than formats such as XML or JSON. The MARC record consists of three main parts:

The leader
The directory
The base record

Figure 26.3 shows an annotated version of your raw MARC record to help visualize how the record is laid out.

Figure 26.3. Annotated version of the MARC record

The leader contains information about the record itself, such as the length of the record and where to find the base record. The directory of the record tells you about the information contained in the record and how to access it. For example, you care only about the author and title of the book. The directory will tell you that the record contains this information and where to look in the file to find it. Finally, the base record is where all the information you need is located. But without the leader and directory, you don’t have the information needed to make sense of this part of the file.

26.2.2. Getting the data

The first thing you need to do is get some MARC record data you can work with. Thankfully, archive.org has a great collection of freely available MARC records. For this project, you’re going to use a collection of records from the Oregon Health & Science University library. Go to the project page on archive.org:

https://archive.org/download/marc_oregon_summit_records/catalog_files/

Download the ohsu_ncnm_wscc_bibs.mrc file. For this lesson, you’ll rename the file sample.mrc. At 156 MB, this file is the smallest of the bunch, but if you’d like to play around with the others, they should all work equally as well.

26.2.3. Checking the leader and iterating through your records

Your .mrc file isn’t a single MARC record but rather a collection of records. Before worrying about the details of a single record, you need to figure out how to separate all the records in this collection. Unlike many other formats for holding serialized data, there’s no delimiter to separate files. You can’t simply split your ByteString stream on a character in order to split your list of records. Instead, you need to look into the leader of each record to see how long it is. By looking at the length, you can then iterate through the list and collect records as you go. To begin, let’s create synonyms for your MarcRecord and MarcLeader.

Listing 26.9. Type synonyms for `MarcRecordRaw` and `MarcLeaderRaw`

type MarcRecordRaw = B.ByteString
type MarcLeaderRaw = B.ByteString

Because you’re primarily manipulating bytes, nearly all of your types when working with the raw MARC record are going to be ByteStrings. But using type synonyms will make it much easier to read your code and understand your type signatures. The first take you want to do is to get the leader from the record:

getLeader :: MarcRecordRaw -> MarcLeaderRaw

The leader is the first 24 bytes of the record, as shown in figure 26.4.

Figure 26.4. The leader in your record highlighted

You can declare a variable to keep track of your leader length.

Listing 26.10. Declaring the length of the leader to be 24

leaderLength :: Int
leaderLength = 24

Getting the leader from a MARC record is as straightforward as taking the first 24 characters of the MarcRecord.

Listing 26.11. `getLeader` grabs the first 24 bytes of the record

getLeader :: MarcRecordRaw -> MarcLeaderRaw
getLeader record = B.take leaderLength record

Just as the first 24 bytes of the MARC record is the leader, the first 5 bytes of the leader contain a number telling you the length of the record. For example, in figure 26.4 you see that the record starts with 01292, which means this record is 1,292 bytes long. To get the length of your entire record, you need to take these first five characters and then convert them to an Int type. You’ll create a useful helper function, rawToInt, which will safely convert your ByteString to Text, then convert that Text to a String, and finally use read to parse an Int.

Listing 26.12. `rawToInt` and `getRecordLength`

rawToInt :: B.ByteString -> Int
rawToInt = (read . T.unpack . E.decodeUtf8)

getRecordLength :: MarcLeaderRaw -> Int
getRecordLength leader = rawToInt (B.take 5 leader)

Now that you have a way to figure out the length of a single record, you can think about separating all the records that you find into a list of MarcRecords. You’ll consider your file a ByteString. You want a function that will take that ByteString and separate it into a pair of values: the first record and the rest of the remaining ByteString. You’ll call this function nextAndRest, which has the following type signature:

nextAndRest :: B.ByteString -> (MarcRecordRaw,B.ByteString)

You can think of this pair of values as being the same as getting the head and tail of a list. To get this pair, you need to get the length of the first record in the stream and then split the stream at this value.

Listing 26.13. `nextAndRest` breaks a stream of records into a head and tail

nextAndRest :: B.ByteString -> (MarcRecordRaw,B.ByteString)
nextAndRest marcStream =  B.splitAt recordLength marcStream
  where recordLength = getRecordLength marcStream

To iterate through the entire file, you recursively use this function to take a record and the rest of the file. You then put the record in a list and repeat on the rest of the file until you reach the end.

Listing 26.14. Converting a stream of raw data into a list of records

allRecords :: B.ByteString -> [MarcRecordRaw]
allRecords marcStream = if marcStream == B.empty
                        then []
                        else next : allRecords rest
  where (next, rest) = nextAndRest marcStream

You can test allRecords by rewriting your main to read in your sample.mrc file and print out the length of that file:

main :: IO ()
main = do
  marcData <- B.readFile "sample.mrc"
  let marcRecords = allRecords marcData
  print (length marcRecords)

You can run your main by either compiling your program or loading it into GHCi and calling main:

GHCi> main
140328

There are 140,328 records in this collection! Now that you’ve split up all of your records, you can move on to figuring out exactly how to get all of your Title and Author data.

26.2.4. Reading the directory

MARC records store all the information about a book in fields. Each field has a tag and subfields that tell you more about the information that’s in a book (such as author, title, subject, and publication date). Before you can worry about processing the fields, you need to look up all the information about those fields in the directory. Like everything else in our MARC records, the directory is a ByteString, but you can create another synonym for readability.

Listing 26.15. Type synonym for `MarcDirectoryRaw`

type MarcDirectoryRaw = B.ByteString

Unlike the leader, which is always the first 24 characters, the directory can be of variable size. This is because each record may contain a different number of fields. You know that the directory starts after the leader, but you have to figure out where the directory ends. Unfortunately, the leader doesn’t tell you this information directly. Instead it tells you the base address, which is where the base record begins. The directory, then, is what’s missing from where the leader ends and the base record begins.

Information about the base address is located in the leader starting with the 12th character and including the 16th byte (for a total of 5 bytes), assuming a 0 index. To access this, you can take the leader, drop the first 12 characters from it, and then take the next 5 in the remaining 12 of the leader. After this, you have to convert this value from a ByteString to an Int, just as you did with the recordLength.

Listing 26.16. Getting the base address to determine the size of the directory

getBaseAddress :: MarcLeaderRaw -> Int
getBaseAddress leader = rawToInt (B.take 5 remainder)
  where remainder = B.drop 12 leader

Then, to calculate the length of the directory, you subtract the (leaderLength + 1) from the base address, giving you the value of space between these two values.

Listing 26.17. Calculating the length of the directory with `getDirectoryLength`

getDirectoryLength :: MarcLeaderRaw -> Int
getDirectoryLength leader = getBaseAddress leader - (leaderLength + 1)

You can now put all these pieces together to get the directory. You start by looking up the directory length from the record, and then dropping the leader length and taking the length directly from that.

Listing 26.18. Putting everything together to `getDirectory`

getDirectory :: MarcRecordRaw -> MarcDirectoryRaw
getDirectory record = B.take directoryLength afterLeader
    where directoryLength = getDirectoryLength record
          afterLeader = B.drop leaderLength record

At this point, you’ve come a long way in understanding this rather opaque format. Now you have to make sense of what’s inside the directory.

26.2.5. Using the directory to look up fields

At this point, your directory is a big ByteString, which you still need make sense of. As mentioned earlier, the directory allows you to look up fields in the base record. It also tells you what fields there are. Thankfully, each instance of this field metadata is exactly the same size: 12 bytes.

Listing 26.19. `MarcDirectoryRaw` type synonym and `dirEntryLength`

type MarcDirectoryEntryRaw = B.ByteString

dirEntryLength :: Int
dirEntryLength = 12

Next you need to split up your directory into a list of MarcDirectoryEntries. Here’s the type signature of this function:

splitDirectory :: MarcDirectoryRaw -> [MarcDirectoryEntryRaw]

This is a fairly straightforward function: you take a chunk of 12 bytes and add them to a list until there’s no more list left.

Listing 26.20. `splitDirectory` breaks down the directory into its entries

splitDirectory directory = if directory == B.empty
                           then []
                           else nextEntry : splitDirectory restEntries
  where (nextEntry, restEntries) = B.splitAt dirEntryLength directory

Now that you have this list of raw DirectoryEntries, you’re close to finally getting your author and title data.

26.2.6. Processing the directory entries and looking up MARC fields

Each entry in the directory is like a miniature version of the record leader. The metadata for each entry has the following information:

Tag of the field (first three characters)
Length of the field (next four characters)
Where the field starts relative to the base address (rest of the chars)

Because you want to use all this information, you’re going to create a data type for your FieldMetadata.

Listing 26.21. `FieldMetadata` type

data FieldMetadata = FieldMetadata { tag         :: T.Text
                                   , fieldLength :: Int
                                   , fieldStart  :: Int } deriving Show

Next you have to process your list of MarcDirectoryEntryRaw into a list of FieldMetadata. As is often the case whenever you’re working with lists, it’s easier to start with transforming a single MarcDirectoryEntryRaw into a FieldMetadata type.

Listing 26.22. Converting a raw directory entry into a `FieldMetadata` type

makeFieldMetadata :: MarcDirectoryEntryRaw -> FieldMetadata
makeFieldMetadata entry = FieldMetadata textTag theLength theStart
  where (theTag,rest) = B.splitAt 3 entry
        textTag = E.decodeUtf8 theTag
        (rawLength,rawStart) = B.splitAt 4 rest
        theLength = rawToInt rawLength
        theStart = rawToInt rawStart

Now converting a list of one type to a list of another is as simple as using map.

Listing 26.23. Mapping `makeFieldMetadata` to `[FieldMetadata]`

getFieldMetadata ::  [MarcDirectoryEntryRaw] -> [FieldMetadata]
getFieldMetadata rawEntries = map makeFieldMetadata rawEntries

With getFieldMetadata, you can write a function that lets you look up the field itself. Now that you’re looking up fields, you need to stop thinking in bytes and start thinking in text. Your fields will have information about author and title, and other text data. You’ll create another type synonym for your FieldText.

Listing 26.24. Type synonym for `FieldText`

type FieldText = T.Text

What you want now is to take a MarcRecordRaw, FieldMetadata and get back a FieldText so you can start looking up useful values!

To do this, you first have to drop both the leader and the directory from your MarcRecord so you end up with the base record. Then you can drop the fieldStart from the record and finally take the fieldLength from this remaining bit.

Listing 26.25. Getting the `FieldText`

getTextField :: MarcRecordRaw -> FieldMetadata -> FieldText
getTextField record fieldMetadata = E.decodeUtf8 byteStringValue
  where recordLength = getRecordLength record
        baseAddress = getBaseAddress record
        baseRecord = B.drop baseAddress record
        baseAtEntry = B.drop (fieldStart fieldMetadata) baseRecord
        byteStringValue =  B.take (fieldLength fieldMetadata) baseAtEntry

You’ve come a long way in understanding this mysterious format. You have just one step to go, which is processing the FieldText into something you can use.

26.2.7. Getting Author and Title information from a MARC field

In MARC records, each special value is associated with a tag. For example, the Title tag is 245. Unfortunately, this isn’t the end of the story. Each field is made up of subfields that are separated by a delimiter, the ASCII character number 31. You can use toEnum to get this character.

Listing 26.26. Getting the field delimiter

fieldDelimiter :: Char
fieldDelimiter = toEnum 31

You can use T.split to split the FieldText into subfields. Each subfield is represented by a single character. Each subfield contains a value—for example, a title or author. Preceding the value is the subfield code, which is a single letter, as shown in figure 26.5.

Figure 26.5. An example title subfield `a`. Notice that `a` is the first character of the title text you receive.

To fetch your title, you want field 245 and subfield a, with subfield a being the main title. For your author, you want field 100 and subfield a.

Listing 26.27. Tags and subfield codes for title and author

titleTag :: T.Text
titleTag = "245"

titleSubfield :: Char
titleSubfield = 'a'

authorTag :: T.Text
authorTag = "100"

authorSubfield :: Char
authorSubfield = 'a'

To get the value of a field, you need to look up its location in the record by using FieldMetadata. Then you split the raw field into its subfields. Finally, you look at the first character in each subfield to see whether the subfield you want is there.

Now you have another problem. You don’t know for certain that the field you want will be in your record, and you also don’t know that your subfield will be in your field. You need to use the Maybe type to check both of these. You’ll start with lookupFieldMetadata, which will check the directory for the FieldMedata that you’re looking for. If the field doesn’t exist, it returns Nothing; otherwise, it returns just your field.

Listing 26.28. Safely looking up `FieldMetadata` from the directory

lookupFieldMetadata :: T.Text -> MarcRecordRaw -> Maybe FieldMetadata
lookupFieldMetadata aTag record = if length results < 1
                                  then Nothing
                                  else Just (head results)

  where metadata = (getFieldMetadata . splitDirectory . getDirectory)
                   record
        results = filter ((== aTag) . tag) metadata

Because you’re going to be concerned with only looking up both a field and a subfield at the same time, you’ll pass this Maybe FieldMetadata into the function that looks up a subfield. The lookupSubfield function will take a Maybe FieldMetadata argument, the subfield Char, and the MarcRecordRaw, returning a Maybe BC.ByteString of the data inside the subfield.

Listing 26.29. Safely looking up a potentially missing subfield

lookupSubfield :: (Maybe FieldMetadata) -> Char ->
                  MarcRecordRaw -> Maybe T.Text
lookupSubfield Nothing subfield record = Nothing                   1
lookupSubfield (Just fieldMetadata) subfield record =
    if results == []                                               2
    then Nothing                                                   3
    else Just ((T.drop 1 . head) results)                          4
  where rawField = getTextField record fieldMetadata
        subfields = T.split (== fieldDelimiter) rawField
        results = filter ((== subfield) . T.head) subfields

1 If the metadata is missing, clearly you can’t look up a subfield.
2 If the results of your search for the subfield are empty, the subfield isn’t there.
3 Empty results mean you return nothing.
4 Otherwise, you turn your subfield value into Text and drop the first character, which is the subfield code.

All you care about is the value for a specific field/subfield combo. Next you’ll create a specific lookupValue function that takes a tag, a subfield char, and a record.

Listing 26.30. General `lookupValue` function for looking up tag-subfield code pairs

lookupValue :: T.Text -> Char -> MarcRecordRaw -> Maybe T.Text
lookupValue aTag subfield record = lookupSubfield entryMetadata
                                                  subfield record
  where entryMetadata = lookupFieldMetadata aTag record

You can wrap up getting your values by making two helper functions for lookupAuthor and lookupTitle by using partial application.

Listing 26.31. Specific cases of looking up `Title` and `Author`

lookupTitle :: MarcRecordRaw -> Maybe Title
lookupTitle = lookupValue titleTag titleSubfield

lookupAuthor :: MarcRecordRaw -> Maybe Author
lookupAuthor = lookupValue authorTag authorSubfield

At this point, you’ve completely abstracted away the details of working with your MARC record format, and can build your final main, which will tie this all together.

26.3. Putting it all together

You’ve tackled the mess of writing a parser for your MARC records, but now you have access to a wide range of book information you can use. Remembering that you want as little in your main IO action as possible, and you also want to reduce all you have to do to converting a ByteString (representing the MARC file) to HTML (representing your output file). The first step is to convert your ByteString to a list of (Maybe Title, Maybe Author) pairs.

Listing 26.32. Raw MARC records to `Maybe Title`, `Maybe Author` pairs

marcToPairs :: B.ByteString -> [(Maybe Title, Maybe Author)]
marcToPairs marcStream = zip titles authors
 where records = allRecords marcStream
       titles = map lookupTitle records
       authors = map lookupAuthor records

Next you’d like to change these Maybe pairs into a list of books. You’ll do this by only making a Book when both Author and Title are Just values. You’ll use the fromJust function found in Data.Maybe to help with this.

Listing 26.33. Convert `Maybe` values into `Books`

pairsToBooks :: [(Maybe Title, Maybe Author)] -> [Book]
pairsToBooks pairs = map ((title, author) -> Book {
                              title = fromJust title
                             ,author = fromJust author
                             }) justPairs
 where justPairs = filter ((title,author) -> isJust title
                                              && isJust author) pairs

You already have your booksToHtml function from before, so now you can compose all these functions together to get your final processRecords function. Because there are so many records in your files, you’ll also provide a parameter to specify the number of records you’re looking up.

Listing 26.34. Putting it all together in `processRecords`

processRecords :: Int -> B.ByteString -> Html
processRecords n = booksToHtml . pairsToBooks . (take n) .  marcToPairs

Despite this being a lesson on I/O, and this being a fairly intensive I/O task, you might be surprised at how remarkably minimal your final main IO action is:

main :: IO ()
main = do
   marcData <- B.readFile  "sample.mrc"
   let processed = processRecords 500 marcData
   TIO.writeFile "books.html" processed

Now you’ve successfully converted your raw MARC records into a much more readable format. Notice that Unicode values also came out okay!

With lookupValue, you also have a nice, general tool you can use to look up any tag and subfield specified in the MARC standard.

Summary

In this capstone, you

Modeled textual book data by using the Text type
Wrote tools to perform binary fill processing by using ByteString to manipulate bits
Safely managed Unicode text within a binary document by using decodeUtf8 and encodeUtf8
Successfully transformed an opaque binary format into readable HTML

Extending the exercise

Now that you know the basics of processing MARC records, there’s a world of interesting book data out there to explore. If you’d like to extend this exercise, look into fleshing out more of the details of processing the MARC record. For example, you may have noticed that trailing punctuation sometimes appears after our title. This is because a subfield b contains the rest of the extended title. Combining subfields a and b will give you the full title. The Library of Congress (LoC) provides extensive information on MARC records, and you can start exploring at www.loc.gov/marc/bibliographic/.

Another challenge you didn’t tackle is dealing with an annoying non-Unicode character encoding that exists in a large number of MARC records called MARC-8. In MARC-8, a small subset of the Unicode characters is represented differently for historical reasons. The LoC has resources to add in this conversion: www.loc.gov/marc/specifications/speccharconversion.html. Whether a record is encoded in MARC-8 or standard Unicode can be determined from the leader. See the “Character Coding Scheme” section of the official LoC documentation: www.loc.gov/marc/bibliographic/bdleader.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Lesson 26. Capstone: Processing binary files and book data

Create new playlist

Sign In

Sign Up

Lesson 26. Capstone: Processing binary files and book data

Listing 26.1. The necessary imports for marc_to_html.hs

26.1. Working with book data

Listing 26.2. Type synonyms for Author and Title

Listing 26.3. Create a Book type

Listing 26.4. Html type synonym

Listing 26.5. bookToHtml creates an individual snippet of HTML from a book

Listing 26.6. A collection of sample books

Listing 26.7. Turning a list of books into an HTML document with booksToHtml

Listing 26.8. Temporary main to write your books list to HTML

Figure 26.1. Your book data rendered as HTML

26.2. Working with MARC records

Figure 26.2. The content of a raw MARC record

26.2.1. Understanding the structure of a MARC record

Figure 26.3. Annotated version of the MARC record

26.2.2. Getting the data

26.2.3. Checking the leader and iterating through your records

Listing 26.9. Type synonyms for MarcRecordRaw and MarcLeaderRaw

Figure 26.4. The leader in your record highlighted

Listing 26.10. Declaring the length of the leader to be 24

Listing 26.11. getLeader grabs the first 24 bytes of the record

Listing 26.12. rawToInt and getRecordLength

Listing 26.13. nextAndRest breaks a stream of records into a head and tail

Listing 26.14. Converting a stream of raw data into a list of records

26.2.4. Reading the directory

Listing 26.15. Type synonym for MarcDirectoryRaw

Listing 26.16. Getting the base address to determine the size of the directory

Listing 26.17. Calculating the length of the directory with getDirectoryLength

Listing 26.18. Putting everything together to getDirectory

26.2.5. Using the directory to look up fields

Listing 26.19. MarcDirectoryRaw type synonym and dirEntryLength

Listing 26.20. splitDirectory breaks down the directory into its entries

26.2.6. Processing the directory entries and looking up MARC fields

Listing 26.21. FieldMetadata type

Listing 26.22. Converting a raw directory entry into a FieldMetadata type

Listing 26.23. Mapping makeFieldMetadata to [FieldMetadata]

Listing 26.24. Type synonym for FieldText

Listing 26.25. Getting the FieldText

26.2.7. Getting Author and Title information from a MARC field

Listing 26.26. Getting the field delimiter

Figure 26.5. An example title subfield a. Notice that a is the first character of the title text you receive.

Listing 26.27. Tags and subfield codes for title and author

Listing 26.28. Safely looking up FieldMetadata from the directory

Listing 26.29. Safely looking up a potentially missing subfield

Listing 26.30. General lookupValue function for looking up tag-subfield code pairs

Listing 26.31. Specific cases of looking up Title and Author

26.3. Putting it all together

Listing 26.32. Raw MARC records to Maybe Title, Maybe Author pairs

Listing 26.33. Convert Maybe values into Books

Listing 26.34. Putting it all together in processRecords

Summary

Extending the exercise

Table of Contents for
Lesson 26. Capstone: Processing binary files and book data

Listing 26.2. Type synonyms for `Author` and `Title`

Listing 26.3. Create a `Book` type

Listing 26.4. `Html` type synonym

Listing 26.5. `bookToHtml` creates an individual snippet of HTML from a book

Listing 26.7. Turning a list of books into an HTML document with `booksToHtml`

Listing 26.8. Temporary `main` to write your books list to HTML

Listing 26.9. Type synonyms for `MarcRecordRaw` and `MarcLeaderRaw`

Listing 26.11. `getLeader` grabs the first 24 bytes of the record

Listing 26.12. `rawToInt` and `getRecordLength`

Listing 26.13. `nextAndRest` breaks a stream of records into a head and tail

Listing 26.15. Type synonym for `MarcDirectoryRaw`

Listing 26.17. Calculating the length of the directory with `getDirectoryLength`

Listing 26.18. Putting everything together to `getDirectory`

Listing 26.19. `MarcDirectoryRaw` type synonym and `dirEntryLength`

Listing 26.20. `splitDirectory` breaks down the directory into its entries

Listing 26.21. `FieldMetadata` type

Listing 26.22. Converting a raw directory entry into a `FieldMetadata` type

Listing 26.23. Mapping `makeFieldMetadata` to `[FieldMetadata]`

Listing 26.24. Type synonym for `FieldText`

Listing 26.25. Getting the `FieldText`

Figure 26.5. An example title subfield `a`. Notice that `a` is the first character of the title text you receive.

Listing 26.28. Safely looking up `FieldMetadata` from the directory

Listing 26.30. General `lookupValue` function for looking up tag-subfield code pairs

Listing 26.31. Specific cases of looking up `Title` and `Author`

Listing 26.32. Raw MARC records to `Maybe Title`, `Maybe Author` pairs

Listing 26.33. Convert `Maybe` values into `Books`

Listing 26.34. Putting it all together in `processRecords`