In this capstone, you’re going to use the data on books created by libraries to make a simple HTML document. Libraries collectively spend a huge amount of time cataloging every possible book in existence. Thankfully, much of this data is freely available to anyone who wants to explore it. Harvard Library alone has released 12 million book records to be used for free by the public (http://library.harvard.edu/open-metadata). The Open Library project contains millions of additional records for use (https://archive.org/details/ol_data).
In a time when data science is a hot trend, it would be great to make some fun projects with all this data. But there’s a big challenge to using this data. Libraries store their book-related metadata in a rather obscure format called a MARC record (for Machine-Readable Cataloging record). This makes using library data much more challenging than if it were in a more common format such as JSON or XML. MARC records are in a binary format that also makes heavy use of Unicode to properly store character encodings. To use MARC records, you have to be careful about separating when you’re working with bytes from when you’re working with text. This is a perfect problem to explore all you’ve learned in this unit!
Our goal for this capstone is to take a collection of MARC records and convert it into an HTML document that lists the titles and authors of every book in the collection. This will leave you with a solid foundation to further explore extracting data from MARC records:
You’ll be writing all of your code in a single file, marc_to_html.hs. To get started, you’ll need the following imports (plus your OverloadedStrings extension).
{-# LANGUAGE OverloadedStrings #-} 1 import qualified Data.ByteString as B 2 import qualified Data.Text as T 3 import qualified Data.Text.IO as TIO 4 import qualified Data.Text.Encoding as E 5 import Data.Maybe 6
You may have noticed that you’re not importing Data.ByteString.Char8. This is because when working with Unicode data, you never want to confuse Unicode text with ASCII text. The best way to ensure this is to use plain old ByteStrings for manipulating bytes and Text for everything else.
Unpacking MARC records is going to be a bit of work, so it’s good to figure out where you want to end up before you get lost. Your primary goal is to convert a list of books into an HTML document. The books being in an obscure format is one obstacle to our goal. In this capstone, you’re concerned with recording only the author and title of the books. You can use a type synonym for these properties. You could use String, but as mentioned in lesson 23, as a general rule it’s much better to use Text when dealing with any large task consisting mostly of text data.
Now you can create your type synonyms for Author and Title.
type Author = T.Text type Title = T.Text
Your Book type will be the product type of Author and Title.
data Book = Book { author :: Author ,title :: Title } deriving Show
Your final function for this will be called booksToHtml and will have the type [Books] -> Html. Before implementing this function, you first need to determine what type Html will be, and ideally how to make an individual book into a snippet of HTML. You can use the Text type once again to model your HTML.
type Html = T.Text
To make transforming a list of books easier to turn into HTML, you’ll start with creating a snippet of HTML for a single book. Your HTML will create a paragraph element, and then denote the title with a <strong> tag and the author with an <em> tag.
bookToHtml :: Book -> Html bookToHtml book = mconcat ["<p> " ,titleInTags ,authorInTags ,"</p> "] where titleInTags = mconcat["<strong>",(title book),"</strong> "] authorInTags = mconcat["<em>",(author book),"</em> "]
Next you need some sample books you can work with.
book1 :: Book book1 = Book { title = "The Conspiracy Against the Human Race" ,author = "Ligotti, Thomas" } book2 :: Book book2 = Book { title = "A Short History of Decay" ,author = "Cioran, Emil" } book3 :: Book book3 = Book { title = "The Tears of Eros" ,author = "Bataille, Georges" }
In GHCi, you can test this bit of code:
GHCi> bookToHtml book1 "<p> <strong>The Conspiracy Against the Human Race</strong> <em>Ligotti, Thomas</em> </p> "
To transform a list of books, you can map your bookToHtml function over the list. You also need to make sure you add html, head, and body tags as well.
booksToHtml :: [Book] -> Html booksToHtml books = mconcat ["<html> " , "<head><title>books</title>" ,"<meta charset='utf-8'/>" 1 ,"</head> "] , "<body> " , booksHtml , " </body> " , "</html>"] where booksHtml = (mconcat . (map bookToHtml)) books
To test this out, you can put your books in a list:
myBooks :: [Book] myBooks = [book1,book2,book3]
Finally, you can build a main and test out your code so far. You’ll assume you’re writing to a file called books.html. Remember that your Html type is Text. To write text to a file, you’ll also need to include Text.IO.
main :: IO () main = TIO.writeFile "books.html" (booksToHtml myBooks)
Running this program will output your books.html file. Opening it up, you can see that it looks like you’d expect (see figure 26.1).
With the ability to write books to a file, you can tackle the more complicated issue of working with MARC records.
The MARC record is the standard used in libraries for recording and transmitting information about books (called bibliographic data). If you’re interested in data about books, MARC records are an important format to understand. There are many large, freely available collections of MARC records online. You’ll be using the Oregon Health & Science University library records in this capstone. As noted earlier, MARC stands for Machine-Readable Cataloging record. As indicated by the name, MARC records are designed to be read by machines. Unlike formats such as JSON and XML, they aren’t designed to be human-readable. If you open a MARC record file, you’ll see something that looks like figure 26.2.
If you’ve ever worked with the ID3 tag format for storing MP3 metadata, you’ll find MARC records are similar.
The MARC record standard was developed in the 1960s with the primary aim of making it efficient to store and transmit information. Because of this, MARC records are much less flexible and extensible than formats such as XML or JSON. The MARC record consists of three main parts:
Figure 26.3 shows an annotated version of your raw MARC record to help visualize how the record is laid out.
The leader contains information about the record itself, such as the length of the record and where to find the base record. The directory of the record tells you about the information contained in the record and how to access it. For example, you care only about the author and title of the book. The directory will tell you that the record contains this information and where to look in the file to find it. Finally, the base record is where all the information you need is located. But without the leader and directory, you don’t have the information needed to make sense of this part of the file.
The first thing you need to do is get some MARC record data you can work with. Thankfully, archive.org has a great collection of freely available MARC records. For this project, you’re going to use a collection of records from the Oregon Health & Science University library. Go to the project page on archive.org:
https://archive.org/download/marc_oregon_summit_records/catalog_files/
Download the ohsu_ncnm_wscc_bibs.mrc file. For this lesson, you’ll rename the file sample.mrc. At 156 MB, this file is the smallest of the bunch, but if you’d like to play around with the others, they should all work equally as well.
Your .mrc file isn’t a single MARC record but rather a collection of records. Before worrying about the details of a single record, you need to figure out how to separate all the records in this collection. Unlike many other formats for holding serialized data, there’s no delimiter to separate files. You can’t simply split your ByteString stream on a character in order to split your list of records. Instead, you need to look into the leader of each record to see how long it is. By looking at the length, you can then iterate through the list and collect records as you go. To begin, let’s create synonyms for your MarcRecord and MarcLeader.
type MarcRecordRaw = B.ByteString type MarcLeaderRaw = B.ByteString
Because you’re primarily manipulating bytes, nearly all of your types when working with the raw MARC record are going to be ByteStrings. But using type synonyms will make it much easier to read your code and understand your type signatures. The first take you want to do is to get the leader from the record:
getLeader :: MarcRecordRaw -> MarcLeaderRaw
The leader is the first 24 bytes of the record, as shown in figure 26.4.
You can declare a variable to keep track of your leader length.
leaderLength :: Int leaderLength = 24
Getting the leader from a MARC record is as straightforward as taking the first 24 characters of the MarcRecord.
getLeader :: MarcRecordRaw -> MarcLeaderRaw getLeader record = B.take leaderLength record
Just as the first 24 bytes of the MARC record is the leader, the first 5 bytes of the leader contain a number telling you the length of the record. For example, in figure 26.4 you see that the record starts with 01292, which means this record is 1,292 bytes long. To get the length of your entire record, you need to take these first five characters and then convert them to an Int type. You’ll create a useful helper function, rawToInt, which will safely convert your ByteString to Text, then convert that Text to a String, and finally use read to parse an Int.
rawToInt :: B.ByteString -> Int rawToInt = (read . T.unpack . E.decodeUtf8) getRecordLength :: MarcLeaderRaw -> Int getRecordLength leader = rawToInt (B.take 5 leader)
Now that you have a way to figure out the length of a single record, you can think about separating all the records that you find into a list of MarcRecords. You’ll consider your file a ByteString. You want a function that will take that ByteString and separate it into a pair of values: the first record and the rest of the remaining ByteString. You’ll call this function nextAndRest, which has the following type signature:
nextAndRest :: B.ByteString -> (MarcRecordRaw,B.ByteString)
You can think of this pair of values as being the same as getting the head and tail of a list. To get this pair, you need to get the length of the first record in the stream and then split the stream at this value.
nextAndRest :: B.ByteString -> (MarcRecordRaw,B.ByteString) nextAndRest marcStream = B.splitAt recordLength marcStream where recordLength = getRecordLength marcStream
To iterate through the entire file, you recursively use this function to take a record and the rest of the file. You then put the record in a list and repeat on the rest of the file until you reach the end.
allRecords :: B.ByteString -> [MarcRecordRaw] allRecords marcStream = if marcStream == B.empty then [] else next : allRecords rest where (next, rest) = nextAndRest marcStream
You can test allRecords by rewriting your main to read in your sample.mrc file and print out the length of that file:
main :: IO () main = do marcData <- B.readFile "sample.mrc" let marcRecords = allRecords marcData print (length marcRecords)
You can run your main by either compiling your program or loading it into GHCi and calling main:
GHCi> main 140328
There are 140,328 records in this collection! Now that you’ve split up all of your records, you can move on to figuring out exactly how to get all of your Title and Author data.
MARC records store all the information about a book in fields. Each field has a tag and subfields that tell you more about the information that’s in a book (such as author, title, subject, and publication date). Before you can worry about processing the fields, you need to look up all the information about those fields in the directory. Like everything else in our MARC records, the directory is a ByteString, but you can create another synonym for readability.
type MarcDirectoryRaw = B.ByteString
Unlike the leader, which is always the first 24 characters, the directory can be of variable size. This is because each record may contain a different number of fields. You know that the directory starts after the leader, but you have to figure out where the directory ends. Unfortunately, the leader doesn’t tell you this information directly. Instead it tells you the base address, which is where the base record begins. The directory, then, is what’s missing from where the leader ends and the base record begins.
Information about the base address is located in the leader starting with the 12th character and including the 16th byte (for a total of 5 bytes), assuming a 0 index. To access this, you can take the leader, drop the first 12 characters from it, and then take the next 5 in the remaining 12 of the leader. After this, you have to convert this value from a ByteString to an Int, just as you did with the recordLength.
getBaseAddress :: MarcLeaderRaw -> Int getBaseAddress leader = rawToInt (B.take 5 remainder) where remainder = B.drop 12 leader
Then, to calculate the length of the directory, you subtract the (leaderLength + 1) from the base address, giving you the value of space between these two values.
getDirectoryLength :: MarcLeaderRaw -> Int getDirectoryLength leader = getBaseAddress leader - (leaderLength + 1)
You can now put all these pieces together to get the directory. You start by looking up the directory length from the record, and then dropping the leader length and taking the length directly from that.
getDirectory :: MarcRecordRaw -> MarcDirectoryRaw getDirectory record = B.take directoryLength afterLeader where directoryLength = getDirectoryLength record afterLeader = B.drop leaderLength record
At this point, you’ve come a long way in understanding this rather opaque format. Now you have to make sense of what’s inside the directory.
At this point, your directory is a big ByteString, which you still need make sense of. As mentioned earlier, the directory allows you to look up fields in the base record. It also tells you what fields there are. Thankfully, each instance of this field metadata is exactly the same size: 12 bytes.
type MarcDirectoryEntryRaw = B.ByteString dirEntryLength :: Int dirEntryLength = 12
Next you need to split up your directory into a list of MarcDirectoryEntries. Here’s the type signature of this function:
splitDirectory :: MarcDirectoryRaw -> [MarcDirectoryEntryRaw]
This is a fairly straightforward function: you take a chunk of 12 bytes and add them to a list until there’s no more list left.
splitDirectory directory = if directory == B.empty then [] else nextEntry : splitDirectory restEntries where (nextEntry, restEntries) = B.splitAt dirEntryLength directory
Now that you have this list of raw DirectoryEntries, you’re close to finally getting your author and title data.
Each entry in the directory is like a miniature version of the record leader. The metadata for each entry has the following information:
Because you want to use all this information, you’re going to create a data type for your FieldMetadata.
data FieldMetadata = FieldMetadata { tag :: T.Text , fieldLength :: Int , fieldStart :: Int } deriving Show
Next you have to process your list of MarcDirectoryEntryRaw into a list of FieldMetadata. As is often the case whenever you’re working with lists, it’s easier to start with transforming a single MarcDirectoryEntryRaw into a FieldMetadata type.
makeFieldMetadata :: MarcDirectoryEntryRaw -> FieldMetadata makeFieldMetadata entry = FieldMetadata textTag theLength theStart where (theTag,rest) = B.splitAt 3 entry textTag = E.decodeUtf8 theTag (rawLength,rawStart) = B.splitAt 4 rest theLength = rawToInt rawLength theStart = rawToInt rawStart
Now converting a list of one type to a list of another is as simple as using map.
getFieldMetadata :: [MarcDirectoryEntryRaw] -> [FieldMetadata] getFieldMetadata rawEntries = map makeFieldMetadata rawEntries
With getFieldMetadata, you can write a function that lets you look up the field itself. Now that you’re looking up fields, you need to stop thinking in bytes and start thinking in text. Your fields will have information about author and title, and other text data. You’ll create another type synonym for your FieldText.
type FieldText = T.Text
What you want now is to take a MarcRecordRaw, FieldMetadata and get back a FieldText so you can start looking up useful values!
To do this, you first have to drop both the leader and the directory from your MarcRecord so you end up with the base record. Then you can drop the fieldStart from the record and finally take the fieldLength from this remaining bit.
getTextField :: MarcRecordRaw -> FieldMetadata -> FieldText getTextField record fieldMetadata = E.decodeUtf8 byteStringValue where recordLength = getRecordLength record baseAddress = getBaseAddress record baseRecord = B.drop baseAddress record baseAtEntry = B.drop (fieldStart fieldMetadata) baseRecord byteStringValue = B.take (fieldLength fieldMetadata) baseAtEntry
You’ve come a long way in understanding this mysterious format. You have just one step to go, which is processing the FieldText into something you can use.
In MARC records, each special value is associated with a tag. For example, the Title tag is 245. Unfortunately, this isn’t the end of the story. Each field is made up of subfields that are separated by a delimiter, the ASCII character number 31. You can use toEnum to get this character.
fieldDelimiter :: Char fieldDelimiter = toEnum 31
You can use T.split to split the FieldText into subfields. Each subfield is represented by a single character. Each subfield contains a value—for example, a title or author. Preceding the value is the subfield code, which is a single letter, as shown in figure 26.5.
To fetch your title, you want field 245 and subfield a, with subfield a being the main title. For your author, you want field 100 and subfield a.
titleTag :: T.Text titleTag = "245" titleSubfield :: Char titleSubfield = 'a' authorTag :: T.Text authorTag = "100" authorSubfield :: Char authorSubfield = 'a'
To get the value of a field, you need to look up its location in the record by using FieldMetadata. Then you split the raw field into its subfields. Finally, you look at the first character in each subfield to see whether the subfield you want is there.
Now you have another problem. You don’t know for certain that the field you want will be in your record, and you also don’t know that your subfield will be in your field. You need to use the Maybe type to check both of these. You’ll start with lookupFieldMetadata, which will check the directory for the FieldMedata that you’re looking for. If the field doesn’t exist, it returns Nothing; otherwise, it returns just your field.
lookupFieldMetadata :: T.Text -> MarcRecordRaw -> Maybe FieldMetadata lookupFieldMetadata aTag record = if length results < 1 then Nothing else Just (head results) where metadata = (getFieldMetadata . splitDirectory . getDirectory) record results = filter ((== aTag) . tag) metadata
Because you’re going to be concerned with only looking up both a field and a subfield at the same time, you’ll pass this Maybe FieldMetadata into the function that looks up a subfield. The lookupSubfield function will take a Maybe FieldMetadata argument, the subfield Char, and the MarcRecordRaw, returning a Maybe BC.ByteString of the data inside the subfield.
lookupSubfield :: (Maybe FieldMetadata) -> Char -> MarcRecordRaw -> Maybe T.Text lookupSubfield Nothing subfield record = Nothing 1 lookupSubfield (Just fieldMetadata) subfield record = if results == [] 2 then Nothing 3 else Just ((T.drop 1 . head) results) 4 where rawField = getTextField record fieldMetadata subfields = T.split (== fieldDelimiter) rawField results = filter ((== subfield) . T.head) subfields
All you care about is the value for a specific field/subfield combo. Next you’ll create a specific lookupValue function that takes a tag, a subfield char, and a record.
lookupValue :: T.Text -> Char -> MarcRecordRaw -> Maybe T.Text lookupValue aTag subfield record = lookupSubfield entryMetadata subfield record where entryMetadata = lookupFieldMetadata aTag record
You can wrap up getting your values by making two helper functions for lookupAuthor and lookupTitle by using partial application.
lookupTitle :: MarcRecordRaw -> Maybe Title lookupTitle = lookupValue titleTag titleSubfield lookupAuthor :: MarcRecordRaw -> Maybe Author lookupAuthor = lookupValue authorTag authorSubfield
At this point, you’ve completely abstracted away the details of working with your MARC record format, and can build your final main, which will tie this all together.
You’ve tackled the mess of writing a parser for your MARC records, but now you have access to a wide range of book information you can use. Remembering that you want as little in your main IO action as possible, and you also want to reduce all you have to do to converting a ByteString (representing the MARC file) to HTML (representing your output file). The first step is to convert your ByteString to a list of (Maybe Title, Maybe Author) pairs.
marcToPairs :: B.ByteString -> [(Maybe Title, Maybe Author)] marcToPairs marcStream = zip titles authors where records = allRecords marcStream titles = map lookupTitle records authors = map lookupAuthor records
Next you’d like to change these Maybe pairs into a list of books. You’ll do this by only making a Book when both Author and Title are Just values. You’ll use the fromJust function found in Data.Maybe to help with this.
pairsToBooks :: [(Maybe Title, Maybe Author)] -> [Book] pairsToBooks pairs = map ((title, author) -> Book { title = fromJust title ,author = fromJust author }) justPairs where justPairs = filter ((title,author) -> isJust title && isJust author) pairs
You already have your booksToHtml function from before, so now you can compose all these functions together to get your final processRecords function. Because there are so many records in your files, you’ll also provide a parameter to specify the number of records you’re looking up.
processRecords :: Int -> B.ByteString -> Html processRecords n = booksToHtml . pairsToBooks . (take n) . marcToPairs
Despite this being a lesson on I/O, and this being a fairly intensive I/O task, you might be surprised at how remarkably minimal your final main IO action is:
main :: IO () main = do marcData <- B.readFile "sample.mrc" let processed = processRecords 500 marcData TIO.writeFile "books.html" processed
Now you’ve successfully converted your raw MARC records into a much more readable format. Notice that Unicode values also came out okay!
With lookupValue, you also have a nice, general tool you can use to look up any tag and subfield specified in the MARC standard.
In this capstone, you
Now that you know the basics of processing MARC records, there’s a world of interesting book data out there to explore. If you’d like to extend this exercise, look into fleshing out more of the details of processing the MARC record. For example, you may have noticed that trailing punctuation sometimes appears after our title. This is because a subfield b contains the rest of the extended title. Combining subfields a and b will give you the full title. The Library of Congress (LoC) provides extensive information on MARC records, and you can start exploring at www.loc.gov/marc/bibliographic/.
Another challenge you didn’t tackle is dealing with an annoying non-Unicode character encoding that exists in a large number of MARC records called MARC-8. In MARC-8, a small subset of the Unicode characters is represented differently for historical reasons. The LoC has resources to add in this conversion: www.loc.gov/marc/specifications/speccharconversion.html. Whether a record is encoded in MARC-8 or standard Unicode can be determined from the leader. See the “Character Coding Scheme” section of the official LoC documentation: www.loc.gov/marc/bibliographic/bdleader.html.
3.17.165.70