3
XML and JSON

XML, the eXtensible Markup Language, is one of the most popular formats for exchanging data over the Web. But it is more than that. It is ubiquitous in our daily life. As Harold and Means (2004, xiii) note:

XML has become the syntax of choice for newly designed document formats across almost all computer applications. It's used on Linux, Windows, Macintosh, and many other computer platforms. Mainframes on Wall Street trade stocks with one another by exchanging XML documents. Children playing games on their home PCs save their documents in XML. Sports fans receive real-time game scores on their cell phones in XML. XML is simply the most robust, reliable, and flexible document syntax ever invented.

XML looks familiar to someone with basic knowledge about HTML, as it shares the same features of a markup language. Nevertheless, HTML and XML both serve their own specific purposes. While HTML is used to shape the display of information, the main purpose of XML is to store data. Therefore, the content of an XML document does not get much nicer when it is opened with a browser—XML is data wrapped in user-defined tags. The user-defined tags make XML much more flexible for storing data than HTML. The main goal of this chapter is not to turn you into an XML coding expert, but to get you used to the key components of XML documents.

We start with a look at a running XML example (Section 3.1) and continue with an inspection of the XML syntax (Section 3.2). There are several ways to limit the endless flexibility in XML markup. We cover technologies that allow extending XML as well as defining new standards that simplify exchanging specific data on the Web efficiently in Sections 3.3 and 3.4. Section 3.5 shows how to handle XML data with R. If your web scraping task does not specifically involve XML data you might be fine to just scan this part of the chapter as you are already familiar with the most important concepts of the XML language from the previous chapter.

Another standard for data storage and interchange we frequently find on the Web is the JavaScript Object Notation, abbreviated JSON. JSON is an increasingly popular alternative to XML for data exchange purposes that comes with some preferable features. The second part of this chapter therefore turns to JSON. We introduce the format with a small example (Section 3.6), talk about the syntax (Section 3.7), and learn how to import JSON content into R and process the information (Section 3.8).

3.1 A short example XML document

We start with a short example of an XML file. The XML code in Figure 3.1 provides a sample of three James Bond movies, along with some basic information. Probably the most distinctive feature of XML code is that human readers have no problem in interpreting the data. Values and names are wrapped in meaningful tags. Each of the three movies is attributed with a name, a year, two actors, the budget, and the box office results. Indentation further facilitates reading but is not a necessary component of XML. It highlights the hierarchical structure of the document. The document starts with the root element <bond_movies>, which also closes the document. The elements are repeated for each movie entry—the content varies. Some elements are special. The element in the first line (<?xml...>) is not repeated, and this and the <actors> element hold some additional information between the <...> signs.

images

Figure 3.1 An XML code example: James Bond movies

The XML language works quite intuitively. You should have no problems to expand and refine the dataset before even knowing every rule of the syntax. In fact, why not try it? Copy the file, go to Wikipedia, look for other details on the movies, and try to add them to the file! You can check later if you have written correct XML code. This is because information is stored as plain text and the tags that allow arranging the data in meaningful ways are entirely user-defined and should be comprehensible. While the tags might not even be necessary to interpret the data, they make XML a computer language and as such useful for communication on and between computers.

The fact that XML is a plain text format is what makes it ultimately compatible. This means that whatever browser, operating system, or PC hardware we use, we can process it. No further information or decoder is needed to interpret the data and their structure. The tags are delivered along with the data and fully describe the document—this is commonly called self-describing. Further, as tags can be nested within each other, XML documents can be used to represent complex data structures (Murrell 2009, p. 116). We will discuss these structures in the following section. To be sure, although XML is so flexible, it possesses a clear set of rules that defines the basic layout of a document. We can use simple tools to check if these rules are obeyed.1 There are also tools to further restrict structure and content in an XML document. Many developers have used the syntax of XML to create new XML-based languages that basically restrict XML to a fixed set of elements, structure, and content, which we will look deeper into in Sections 3.4.3 and 3.4.4. Still, these derived languages remain valid XML. XML has gained a considerable amount of its popularity through these extensions.

The downside of storing information in XML files is a lack of efficiency. Plain text XML documents often hold a lot of redundant information. Note that in standard XML, the starting and closing tags are repeated for every entry. This can consume more space in the document than the actual data. Especially when we deal with large datasets or data that provide highly hierarchical structure, it may take up a lot of memory to try to import and manipulate the data.

The preferred program to open XML files are programs that are capable of highlighting the syntax of the documents and automatically indent the elements according to their level in the hierarchical structure. Current versions of all mainstream browsers are able to layout XML files adequately, and it is quite likely that your favorite code editor is capable of XML highlighting as well. Note, however, that XML files can be very large and contain millions of lines of data, so it may take a while to open them.

In the following sections we will talk more about the syntax of XML. We will learn how to import XML data into R and how to transform it into other data formats that are more convenient for analysis. We will also look at other XML “flavors” that are used to store a variety of data types. You might be surprised about the numerous applications that rely on XML and how one can make use of this knowledge for data scraping purposes.

3.2 XML syntax rules

As any other computer language, XML has a set of syntax rules and key elements we have to know in order to find our way in any document. But fear not: XML rules are very simple.

3.2.1 Elements and attributes

Take another look at Figure 3.1. It helps explain large parts of what we have to know about XML. An XML document always starts with a line that makes declarations for the XML document:

images

version="1.0" indicates the version of XML that is being used. There are currently two versions: XML 1.0 and XML 1.1.2 Additionally, the declaration can, but need not hold the character encoding of the document, which in our case is encoding="ISO-8859-1".3 Another attribute the declaration can contain—but does not in our example—is the standalone attribute, which take values of yes or no and indicates whether there are external markup declarations that may affect the content of the document.4

An XML file must contain one and only one root element that embraces the whole document. In our case, it is:

images

Information is usually stored in elements. An XML element is defined by its start tag and the content. An element frequently has an end tag, but can also be closed in the start tag with a slash /. It can contain

  • other elements.
  • attributes, bits of information that describe the element in more detail. Attributes, like elements, are slots for information, but they cannot contain further elements or attributes.
  • data of any form and length, for example, text, numbers or symbols.
  • a mixture of everything, which sounds complicated but is a rather ordinary case when elements contain other elements that contain data. For example, the <movie> elements in Figure 3.1 all contain an attribute, other elements, and data within the children elements.
  • nothing, which means really nothing—no data, no other element, not even white spaces.

Consider the first <title> element from above:

images

Its constituent parts are

the element title title
the start tag <title>
the end tag </title>
the data value Dr. No

We are already familiar with the start tag–end tag logic from HTML. The benefit of this syntax is that we can easily locate data of a certain element in the document, regardless of where, that is, on which line or hierarchical level it is located. The element <title> occurs three times in the example. We could retrieve all of these elements by building a query like “give me the content of all elements named <title>.” This is what we will learn in Chapter 4 when we learn how to use the query language XPath. A more compact way of writing elements is

images

This element contains

the element name actors
the start tag <actors.../>
first attribute's name bond
first attribute's value Sean Connery
second attribute's name villain
second attribute's value Joseph Wiseman

In this case there is no end tag but only a start tag. This is a so-called empty element because the element contains no data. Empty elements are closed with a slash /. The element in the example is of course not literally empty. Just like in HTML, XML elements can contain attributes that provide further information. There is no limit to the number of attributes an element can contain. The example element has two attributes. They are separated by a white space. Attributes are always part of a start tag and hold their values in quotes after an equal sign. The information stored in attributes is called attribute value. Attribute values always have to be put in quotes, either using single quotes like bond=’Sean Connery’ or double quotes like bond="Daniel Craig". However, if the attribute value itself contains quotes, you should use the opposed pair of quotes for the attribute value:

images

As the structure of an XML document is inherently flexible, there are many ways to store the same content. Note how the actors were stored in the running example in Figure 3.1. Another way would have been the following:

images

All information is retained, but the actors’ names are now stored in elements, not attributes. Both ways are equally valid. The problem with attributes is that they do not allow further branching—attributes cannot be expanded and can only contain one value. Besides, we find them more difficult to read and more inconvenient to extract compared to elements. They are, however, not altogether useless. Take a look at the code in Figure 3.1. Attributes named id are used to make elements with the same name uniquely identifiable. This can be of help when we need to manipulate information in a particular element of the XML tree.

3.2.2 XML structure

Each XML document can be represented as a hierarchical tree. The fact that data are stored in a hierarchical manner is well suited for many data structures we are confronted with: Survey participants are nested within countries. Survey participants’ responses are nested within survey participants. Votes are nested within polling stations that are nested within electoral districts that are nested within countries, and so on. Figure 3.2 gives a graphical representation of the XML data from the XML code in Figure 3.1. At the very top stands the root element <bond_movies>. All other elements have one and only one parent. In fact, we can apply a family tree analogy to the entire document, describing each element as a node:

  • the movie nodes are children of the root node bond_movies;
  • the movie nodes are siblings;
  • the bond_movie node is the parent of the movie nodes, which are parents of the title,..., boxoffice nodes;
  • the title,..., boxoffice nodes are grandchildren of bond_movies.

Note that the attributes and their values are presented in the element value boxes in Figure 3.2, even though they could be viewed as further leaves in the XML tree. However, as attributes cannot be parents to other elements or attributes, they are rather element-describing content than autonomous nodes. Nevertheless, they are strictly speaking attribute nodes.

Elements must be strictly nested, which means that no cross-nesting is allowed. An illegal document structure would be:

images


images

Figure 3.2 Tree perspective on an XML document

While it is theoretically sensible that the element <child> with the value Jonathan opens a new <family> branch containing Jonathan’s wife Julia and their child Jeff, Jonathan’s <child> element has to be closed before the <family> element.

3.2.3 Naming and special characters

One of the strengths of XML is that we are basically free to chose the names of elements. However, there are some naming rules:


  • Element names can be composed of letters, numbers, and other characters, like in <name1> … </name1>. Special characters like ä, ö, ü, é, è, or à are allowed, but not recommended—they might limit the compatibility of XML files across systems.
  • Names must not start with a number, like in <123name> … </123name>.
  • Names must not start with a punctuation character, like in <.name> … </.name>.
  • Names must not start with the letters xml (or XML, or Xml, etc.), like in <xml.rootname> … </xml.rootname>.
  • Element names and attribute names are case sensitive. <movie> is not the same as <MOVIE> or <Movie>.
  • Names must not contain spaces, like in <my family> … </my family>.

As in HTML, there are some characters that cannot be used literally in the content as they are needed for markup. To represent these characters in the content, they have to be replaced by escape sequences. These entities are listed in Table 3.1 and used as follows

images


Table 3.1 Predefined entities in XML

Character Entity reference Description
< &lt; Less than
> &gt; Greater than
& &amp; Ampersand
" &quot; Double quotation mark
&apos; Single quotation mark

You do not always need to escape special characters. For example, apostrophes are sometimes left unescaped, like in "Richard ’Jaws’ Kiel" in the example above. In this case, the apostrophes are unambiguous because the attribute value is enclosed by double quotes. Using apostrophes in XML element values is usually no problem either, because they have no special meaning in the value slot of the element, only inside tags as limiters to attribute values.

3.2.4 Comments and character data

XML provides a way to comment content with the syntax

images

Everything in between <!-- and --> is not treated as part of the XML code and therefore ignored by parsers. Comments may be used between tags or within element content, but not within element or attribute names.

The use of escape sequences can be cumbersome when the elements to be escaped are common in the data values. For example, imagine the following character sequence needs to be stored in an XML file

images

In XML code, this would translate to

images

To avoid this mess, XML provides an environment that prevents the content from being interpreted. It is called CDATA and works as follows

images

All characters in the CDATA section are taken as is. The difference between comments and a CDATA section is that a comment is not part of the document … 

images

… whereas a CDATA section is:

images

If we write both snippets in an XML file and open it with a browser, the comments are not displayed or explicitly highlighted as part of the XML tree. In contrast, the CDATA section is displayed in the tree. If we delete the CDATA tags, this will produce an error because the browser fails to interpret the ampersands and quotation marks.

You may want to try this out yourself. Save the last code snippet with your text editor as an XML file and open it with your browser. Modify the content of the XML file, save it, and reload the content with the browser. Experiment with allowed and disallowed changes. Try special characters, cross-nested tags, and forbidden element names.

3.2.5 XML syntax summary

To sum up, the XML syntax comprises the following set of rules:

  1. An XML document must have a root element.
  2. All elements must have a start tag and be closed, except for the declaration, which is not part of the actual XML document.
  3. XML elements must be properly nested.
  4. XML attribute values must be quoted.
  5. Tags are named with characters and numbers, but may not start with a number or “xml.”
  6. Tag names may not contain spaces and are case sensitive.
  7. Space characters are preserved.
  8. Some characters are illegal and have to be replaced by meta characters.
  9. Comments can be included as follows: <!-- comment -->.
  10. Content can be excluded from parsing using: <![CDATA[...]]>.

3.3 When is an XML document well formed or valid?

In short, an XML document is well formed when it follows all of the syntax rules from the previous section. Techniques to extract information from XML documents rely on properly written syntax. If we are in doubt that an XML document is well formed, there are ways to check. For instance, the XML Validator on http://www.xmlvalidation.com/ checks for mismatches between start and end tags, whether attribute values are quoted, whether illegal characters have been used, in short: whether any of the rules are violated.

We can distinguish between well formed and valid XML. An XML document is valid when it

  1. is well formed and
  2. conforms to the rules of a Document Type Definition.

As we have seen, the structure of an XML document is arbitrary—tag names and levels of hierarchy are defined by the user. However, there is a way to restrict this arbitrariness by using Document Type Definitions, DTDs. A DTD is a set of declarations that defines the XML structure, how elements are named, and what kind of data they should contain. A DTD for our running example in Figure 3.1 could look like this

images

In this variant, the DTD is included in the XML document and wrapped in a DOCTYPE definition, <!DOCTYPE bond_movies [...]>. This is called an internal DTD. For the purpose of web scraping we normally do not need to be able to write DTDs, so we will not explain every detail of the declaration syntax but just provide some fundamentals on the appearance of DTDs. Elements can be declared like

images

Children of elements are declared as follows

images

It gets a bit more complicated with the declaration of mixed content. If, for example, an element contains one or more occurrences of the <child1> to <child3> elements or simply parsed character data, the declaration would look like

images

Declaring attributes can look as follows

images

The IMPLIED attribute value means that the corresponding attribute is optional; REQUIRED would mean that the attribute is required. There are multiple online tools that allow validating XML files against a DTD. Just type “dtd validation” into a search engine and pick one of the first results.

Why should we care whether an XML document is well formed or valid? Above all, it is important to know that many files come with an internal DTD at the beginning of the document. In general, DTDs serve several purposes. Data exchanges can be standardized as senders and receivers know in advance what they are supposed to send and get. As a sender, you can check if your own XML files are valid. As a receiver it is possible to check whether the XML you retrieve is of the kind you or your program expects.

DTD itself is only one of several XML schema languages. Such languages help to describe and constrain the structure and content of an XML document. Another schema language is XML Schema (XSD), developed by W3C. It allows defining a schema in XML syntax and has some merits that are of little interest for our purposes. One area where XML schemas play an important role is XML extensions, which are the topic of the next section.

3.4 XML extensions and technologies

We have seen that XML has advantages compared to HTML for exchanging data on the Web as it is extensible—and thus flexible. However, flexibility also carries the potential for uncertainty or inconsistency, for example, when the same element names are used for different content. Several extensions and technologies exist that improve the usability of XML by suggesting standards or providing techniques to set such standards. Some of the most important of these techniques are described in this section.

3.4.1 Namespaces

Consider the following two pieces of HTML and XML:

images


images

Both pieces store information in the element <title>. If the XML code were embedded in HTML code, this might create confusion. As we will see, there are many XML extensions to store specific data, for example, geographic, graphical, or financial data. All of these languages are basically XML with limited vocabulary. When several of these XML-based languages are used in one document, element or attribute names can become ambiguous if they are multiply assigned. XML namespaces are used to circumvent such problems. The idea is very simple: Ambiguous elements become distinguishable if some unique identifier is added. Just like zip codes allow distinguishing between many different Springfields and area codes make phone numbers unambiguous, namespaces help make elements and attributes uniquely identifiable.

The implementation of namespaces is straightforward:

images

In this example, namespaces are declared in the root element using the xmlns attribute and two prefixes, h and t. The namespace name, that is, the namespace attribute value, usually carries a Uniform Resource Identifier (URI) that points to some Internet address. The URIs in the example are two URLs that refer to an existing Internet resource on the W3C homepage and the fictional domain funnybooknames.com. When dealing with namespaces, note the following rules:

  1. Namespaces can be declared in the root element or in the start tag of any other element. In the latter case, all children of this element are considered part of this namespace.
  2. The namespace name does not necessarily have to be a working URL. Parsers will never try to follow the link, not even a URI. Any other string is fine. However, it is common practice to use URIs for two reasons: First, as they are a long, unique string of characters, duplicates are unlikely, and second, actual URLs can point the human reader to pages where more information about the namespace is given.5
  3. Prefixes do not have to be explicitly stated, so the declaration can either be xmlns or xmlns:prefix . If the prefix is dropped, the xmlns is assumed to be the default namespace and any element without a prefix is considered to be in that namespace. When prefixes are used, it is bound to a namespace in the declaration. Attributes, however, never belong to the default namespace.

3.4.2 Extensions of XML

Thus far, we have praised XML for its flexibility and extensibility. However, standardization also has its benefits in data exchange scenarios. Recall how browsers deal with HTML. They “know” what a table looks like, how headings should be formatted, and so on. In general, many data exchange processes can be standardized because sender and recipient agree on the content and structure of the data to be exchanged.

Following this logic, a multitude of extensions of the XML language has been developed that combine the classical XML features of openness with the benefits of standardization. In that sense, XML has become an important metalanguage—it provides the general architecture for other XML markup languages. Varieties of XML rely on XML schemas that specify allowed structure, elements, attributes, and content. Table 3.2 lists some of the most popular XML derivations. Among them are languages for geographic applications like KML or GPX as well as for web feeds and widely used office document formats. You might be surprised to find that MS Word makes heavy use of XML. To gain basic insight into XML extensions that are ubiquitous on the Web, we focus on two popular XML markup languages—RSS and SVG.

Table 3.2 List of popular XML markup languages

Name Purpose Common filename extensions
Atom web feeds .atom
RSS web feeds .rss
EPUB open e-book .epub
SVG vector graphics .svg
KML geographic visualization .kml, .kmz
GPX GPS data (waypoint, tracks, routes) .gpx
Office Open XML Microsoft Office documents .docx, .pptx, .xlsx
OpenDocument Apache OpenOffice documents .odt, .odp, .ods, .odg
XHTML HTML extension and standardization .xhtml

For a more comprehensive list, see http://en.wikipedia.org/wiki/List_of_XML_markup_languages.

3.4.3 Example: Really Simple Syndication

Web users commonly cultivate a list of bookmarks of their favorite webpages. It can be rather tiresome to regularly check for new content on the sites. Really Simple Syndication (RSS)6 was built to solve this problem—both for the user and the content providers. The basic idea is that news sites, blog owners, etc., convert their content into a standardized format that can be syndicated to any user.

We illustrate the logic of RSS in Figure 3.3. Authors of a blog or news site set up an RSS file that contains some information on the news provider, which is stored on a web server. The file is updated whenever new content is published on the blog. Both are usually done by an RSS creation program like RSS builder. The list of entries or notifications is often called RSS feed or RSS channel and might be located at http://www.example.net/feed.rss. It is written in XML that follows the rules of the RSS format. Common elements that are allowed in this XML flavor are listed in Table 3.3. There are elements that describe the channel and others that describe single entries. Users collect channels by subscribing to an RSS reader or aggregator like Feedly, which automatically locates the RSS feed on a given website and lays out the content. These readers automatically update subscribed feeds and offer further management functionalities. This way, users are able to assemble their own online news.

images

Figure 3.3 How RSS works

Table 3.3 List of common RSS 2.0 elements and their meaning

Element name Meaning
root elements
rss The feed's root element
channel A channel's root element
channel elements
description* Short statement describing the feed
link* URL of the feed's website
title* Name of the feed
item The core information element: each item contains an entry of the feed
item elements
link* URL of the item
title* Title of the item
description* Short description of the item
author Email address of the item's author
category Classification of item's content
enclosure Additional content, for example, audio
guid Unique identifier of the item
image Display of image (with children <url>, <title>, and <link>)
language Language of the feed
pubDate Publishing date of item
source RSS source of the item
ttl “Time-to-live,” number of minutes until the feed is refreshed from the RSS

Elements marked with “*” are mandatory. For more information on RSS 2.0 specification, see http://www.rssboard.org/rss-specification

There are several versions of RSS, the current one being RSS 2.0. RSS syntax has remained fairly simple, especially for users who are familiar with XML. The rules are strict, that is, there is a very limited set of allowed elements and a clear document structure. Consider the following example of a fictional RSS channel accompanying this book:

images

RSS documents start with an XML and RSS declaration in the first two lines. The <channel> element wraps around both meta information and the actual entries. The channel's meta block has three required elements—<title>, <description>, and <link>. In the example, there is another optional element, <lastBuildDate>, that indicates the last time content was changed on the channel. The content block consists of a set of <item> elements. Whenever a new story, blog entry, etc., is published, a new <item> element is added to the feed. <item> elements have three obligatory children—again, they are called <title>, <description>, and <link>. The main content is usually stored in the <description> element. Sometimes the whole entry is stored here, sometimes just the first few lines or a summary. In general, RSS syntax obeys the same set of rules as XML syntax.

Take a moment to look at actual RSS feeds. They are all around the Web and indicated with the RSS icon (inline). There are several popular news and blogging platforms about R. For example, have a look at http://planetr.stderr.org/ where new R packages are posted (via Dirk Eddelbuettel's CRANberries blog http://dirk.eddelbuettel.com/cranberries/), and at http://www.r-bloggers.com/, a meta-blogging platform that collects content from the R blogosphere.

RSS 2.0 is not the only content syndication format. Besides various predecessors, another popular standard is Atom, which is also XML-based and has a very similar syntax. In order to grab RSS feeds into R, we can use the same XML extraction tools that are presented in Section 3.5.

3.4.4 Example: scalable vector graphics

A more peculiar but incredibly popular extension of XML is scalable vector graphics (SVG). SVG is used to represent two-dimensional vector graphics. It has been developed at the W3C since 1999 and was initially released in 2001 (Dailey 2010). The idea was to create a vector graphic format that stores graphic information in lightweight, flexible form for exchange over the Web.

Vector graphic formats consist of basic geometric forms such as points, curves, circles, lines, or polygons, all of which can be expressed mathematically. In contrast, raster graphic formats store graphic information as a raster of pixels, that is, rectangular cells of a certain color. In contrast to raster graphics, vector graphics can be resized without any loss of quality and are usually smaller. As the SVG format is based on XML, SVG graphics can be manipulated with an ordinary text editor. There are, however, SVG editors that simplify this task. For example, Inkscape is an open-source graphics editor that implements SVG by default and runs on all common operating systems.7 In order to view SVG files, we can use current versions of the common browsers.

To get a first impression of how SVG works, Figure 3.4 provides code of a small SVG file. This code generates a stylized representation of the R icon just like the one displayed in Figure 3.5. In fact, if we open an SVG file with the content of the sample code with our browser, we see the graphic shown in Figure 3.5. The syntax does not only resemble XML, it is XML with a limited set of legal elements and attributes. An SVG file starts with the usual XML declaration. The standalone attribute indicates that the document refers to an external file, in this case an external DTD in lines 2 and 3. This DTD is stored at the www.w3.org webpage and describes which elements and attributes are legal in the current SVG version 1.1 (as of March 2014). The actual SVG code that describes the graphic is enclosed in the <svg> element. It contains a namespace and a version attribute.

images

Figure 3.4 SVG code example: R logo

images

Figure 3.5 The R logo as SVG image from code in Figure 3.4

SVG uses a predefined set of elements and attributes to represent parts of a graphic (‘SVG shapes’). Among the basic shapes of SVG are lines (<line>), rectangles (<rect>), circles (<circle>), ellipses (<ellipse>), polygons (<polygon>), (<text>), and, the most general of all, paths (<path>). Each of these elements comes with a specific set of attributes to tune the object's properties, for example, the position of the corners, the size and radius of a circle, and so on. Elements are placed on a virtual coordinate system, with the origin (0,0) in the upper-left corner. Formatted text can also be placed into the graphic. The order of elements is important. A later-listed element covers a previous element—elements can therefore be thought of as layers. Further, there is a palette of special effects like blurs or color gradients. Elements can even be animated. A complex SVG graphic is often generated by quite complex SVG code. The complexity usually does not stem from a highly hierarchical structure—most of the elements are often just children of the root element—but from the mass of elements and their attributes. Our basic graphic in Figure 3.5 is composed of only three elements—two ellipses and one text element. By default, elements come in the compact form of XML element syntax: Elements are usually empty and contain no further information than those given in the attributes.

Back to the example, the locations of the ellipses are defined by their attributes cx and cy, their shape by the horizontal and vertical radius in rx and ry. Colors and other effects can be passed via arguments in the style attribute. The white ellipse is plotted on the top of the grey ellipse simply because it appears second in the code, creating the donut effect. Finally, shape, color, font, and location of the “R” is defined in the <text> element.

Beyond the principle advantages of vector over raster graphics, SVG, in particular, has some features that make it attractive as a graphic standard on the Web: It can be edited with any text editor, opened with the common browsers, follows a familiar syntax as it is basically just XML, and has been developed for a wide range of applications. We have learned that XML is flexible but because of the flexibility cannot be interpreted further by a browser. This is not true for XML extensions such as SVG. As the set of elements and attributes is clearly defined, browsers can be programmed to display SVG content as meaningful graphic, not as code—just as they interpret and display HTML code. In HTML5, SVG graphics can even be embedded as simply as this

images

Why could SVG be useful in the context of automated data collection? At first glance, SVG is a flexible and widely used vector graphics format. From the data collection perspective, however, it is more than that. The information in these graphs—and often more than just the visible parts—are stored in text form and can therefore be searched, subsetted, etc. SVG is becoming more and more popular on the Web and is used for increasingly complex tasks, for example, to store geographic information, create interactive maps, or visualize massive amounts of data.8

The takeaway message of these two examples is that XML is present in many different areas, and many of these applications hold potentially useful information. And the neat thing is: We will learn how easy it is to retrieve and process this information with R, regardless of whether the information is stored in “pure” XML or any of its extensions.

3.5 XML and R in practice

Let us now turn to practice. How can XML files be viewed, how can they be imported and accessed in an R session, and how can we convert information from an XML document into data structures that are more convenient for further graphical or statistical analysis, like ordinary data frames, for example?

As we said before, XML files can be opened and viewed in all text editors and browsers. However, while text editors usually take the XML file as is, modern web browsers automatically parse the XML and try to represent its structure. This fails when the XML document is not valid. In this case, the browser might tell you why it thinks the parsing failed, for example, because of an opening and ending tag mismatch on a certain line. From this perspective, the web browser is a decent tool to check if your XML is well formed. In standard web scraping tasks, we usually do not view XML documents file by file but download them in a first step and import them into our R workspace in a second (see Chapter 9).

3.5.1 Parsing XML

We parse XML for the same reason that we parse HTML documents (see Section 2.4.1), to create a structure-aware representation of XML files that allows a simple information extraction from these files. Similar to what was outlined in the HTML parsing section, the process of parsing XML essentially includes two steps. First, the symbol sequence that constitutes the XML file is read in and used to build a hierarchical tree-like data structure from its elements in the C language, and second, this data structure is translated into an R data structure via the use of handlers.

The package we use to import and parse XML documents is, appropriately enough, called XML (Temple Lang 2013c). Using the XML package we can read, search, and create XML documents—although we only care about the former two tasks. Let us see how to load XML files into R. For DOM-style parsing of XML files one can use xmlParse(). The arguments of the function coincide with those of htmlParse() for the most part. We illustrate the process with the help of technology.xml, an XML file that holds stock information for three technology companies. The first few lines of the document are presented in Figure 3.6. As we see, the file contains stock information like the closing value, lowest and highest value for a day, and the traded volume. To obtain the XML tree with R, we pass the path of the file to xmlParse()’s file argument:


R> library(XML)
R> parsed_stocks <- xmlParse(file ="stocks/technology.xml")
images

Figure 3.6 XML example document: stock data

The xmlParse() function is used to parse the XML document.9 The parsing function offer a set of options that can be ignored in most settings but are still worth knowing. It is possible to treat the input as XML and not as a file name (option asText), to decide whether both namespace URI and prefix should be provided on each node or just the prefix (option fullNamespaceInfo), to determine whether an XML schema is parsed (option isSchema), or to validate the XML against a DTD (option validate). Let us consider this last option in more detail.

Although HTML and XML are very similar in most respects, a noteworthy difference exists in that XML is confined to much stricter specification rules. As we have seen in Section 3.3, valid XML not only has to be well formed, that is, tags must be closed, attributes names must be in quotes, etc., but also has to adhere to the specifications in its DTD. To check whether the document conforms to the specification, a validation step can be included after the DOM has been created by setting the validate argument to TRUE. We try to validate technology.xml with the corresponding external technologystocks.dtd (see Figure 3.7), which is deposited in our folder and referred to in line 2 of the XML file (see Figure 3.6):


R> library(XML)
R> parsed_stocks <- xmlParse(file ="stocks/technology.xml", validate = TRUE)
images

Figure 3.7 DTD of stock data XML file (see Figure 3.6)

There is no complaint; the validation has succeeded. To demonstrate what happens if an XML does not conform to a given DTD, we manipulate the DTD such that the document node is no longer defined. As a consequence, the XML file does not conform to the (corrupted) DTD anymore and the function raises a complaint:


R> library(XML)
R> stocks <- xmlParse(file ="stocks/technology-manip.xml", validate = TRUE)
No declaration for element document
Error: XML document is invalid

In general, the rather bulky logic of XML validation with DTD, XSD, or other schemas should not discourage you from making use of the full power of the XML DOM structure. In most web scraping scenarios, there is no need to validate the files and we can simply process them as they are.

3.5.2 Basic operations on XML documents

Once an XML document is parsed we can access its content using a set of functions in the XML package. While we recommend using the more general and robust XPath for searching and pulling out information from XML documents, here we present some basic operations that might suffice for less complex XML documents. To see how they work, let us go back to our running example: We start by parsing the bond.xml file:

images

When we type bond into our console, the output looks pretty much like the original XML file. We know, however, that the object is anything but pure character data. For instance, we can perform some basic operations on the root element. The top-level node is extracted with the xmlRoot() function; xmlName() and xmlSize() return the root element's name and the number of children:

images

Within the node sets, basic navigation or subsetting works in analogy to indexing ordinary lists in R. That is, we can use numerical or named indices to select certain nodes. This is not possible with objects of class XMLInternalDocument that are generated by xmlParse(). We therefore work with the root object, which belongs to the class XMLInternalElementNode. Indexing with predicate “1” yields the first child:

images

We have to use double brackets to access the internal node. By adding another index, we can move further down the tree and extract the first child of the first child:


R> root[[1]][[1]]
<name>Dr. No</name>

Element names can be used as predicates, too. Using double brackets yields the first element in the tree, single brackets return objects of class XMLInternalNodeList. To see the difference, compare

images

with

images

Names and numbers can also be combined. To return the atomic value of the first <name> element, we could write


R> root[["movie"]][[1]][[1]]
Dr. No

The structure of the object is retained and can be used to locate elements and values. However, content retrieval from XML files via ordinary predicates is quite complex, error prone, and anything but convenient. Further, this method does not capitalize on node relations—a core feature of parsed XML documents. For anybody who is seriously working with XML data, there are good reasons to learn the very powerful query language XPath. We will show how this is done in the next chapter.

In general, all methods and all those to follow are applicable to other XML-based languages as well. The parser does not care about naming and structure of documents as long as the code is valid. Therefore, documents like the RSS sample code from above can be imported just as easy as


images

3.5.3 From XML to data frames or lists

Sometimes it suffices to transform an entire XML object into common R data structures like vectors, data frames, or lists. The XML package provides some appropriate functions that make such operations straightforward if the original structure is not too complex.

Single vectors can be extracted with xmlSApply(), a wrapper function for lapply() and sapply() that is built to deal with children of a given XML node. The function operates on an XML node (provided as first argument), applies any given function on its children (given as the second argument), and commonly returns a vector. We can use the function in combination with xmlValue() and xmlGetAttr() (and other functions; see Table 4.4) to extract element or attribute values:

images

As long as XML documents are flat in the hierarchical sense, that is, the root node's most distant relatives are grandchildren or children, they can usually be transformed easily into a data frame with xmlToDataFrame()

images

Note, however, that this function already runs into trouble with the <actor> element, which is itself empty except for two attributes. The corresponding variable in the data.frame object is left empty with a shrug.

Similarly, a conversion into a list is possible with xmlToList():


R> movie.list <- xmlToList(bond)

XML and other data exchange formats like JSON can store much more complicated data structures. This is what makes them so powerful for data exchange over the Web. Forcing such structures into one common data frame comes at a certain cost—complicated data transformation tasks or the loss of information. xmlToDataFrame() is not an almighty function to achieve the task for which it is named. Rather, we are typically forced to develop and apply own extraction functions.

3.5.4 Event-driven parsing

While parsing the XML example files in Section 3.5.1 was processed quickly by R, files of larger size can lead to overloaded working memory and concomitant data management problems. As a format primarily designed for carrying data across services, XML files are oftentimes of substantially greater size than HTML files. In many instances, file sizes can exceed the memory capacity of ordinary desktop PCs and laptops. This problem is aggravated when data streams are concerned, where XML data arrives iteratively. These applications obstruct the DOM-based parsing approach we have been applying in this and the previous chapter and demand for a more iterative parsing style.

The root of the problem stems from the way the DOM-style parsers process and store information. The parser creates two copies of a given XML file—one as the C-level node set and the second as the data structure in the R language. To detect certain elements in an XML file, we can deal with this problem by employing a parsing technique called event-driven parsing or SAX parsing (Simple API for XML). Event-driven parsing differs from DOM-style parsing in that it skips the construction of the complete DOM at the C level. Instead, event-driven parsers sequentially traverse over an XML file, and once they find a specified element of interest they prompt an instant, user-defined reaction to this event. This procedure provides a huge advantage over DOM-style parsers because the machine's memory never has to hold the complete document.

Let us reconsider technology.xml and the problem of extracting information about the Apple stock. Assume we are interested in obtaining Apple's daily closing value along with the date. Once again, we make use of a handler function to specify how to handle a node of interest. Similar to the extraction problem considered in Section 2.4.3, we define the handler as a nested function to combine it with a reference environment and container variables (see Figure 3.8). branchFun() defines two local variables container_close and container_date, serving as the container variables. Since we are interested in Apple stock information, we suggest the following approach: We start by defining a handler function for the <Apple> nodes (lines 6 and 8). Conditional on these elements, we look for their children called date and close and return their values (lines 7 and 9). A return function getContainer() is defined (line 12) that assembles the container variable's contents into a data frame and returns this object.

images

Figure 3.8 R code for event-driven parsing

To generate a usable instance of the handler function, we execute the function and pass its return value into a new object called h5:

images

We are now ready to run the SAX parser over our technology.xml file using XML’s xmlEventParse() function. Instead of the handlers argument we will pass the handler function to the branches argument. The branches is a more general version of the handlers argument, which allows to specify functions over the entire node content, including its children. This is exactly what we need for this task since in our handler function h5 we have been making use of the xmlChildren function for retrieving child information. Additionally, for the handlers argument we need to pass an empty list:


R> invisible(xmlEventParse(file ="stocks/technology.xml", branches = h5, handlers = list()))

To get an idea about the iterative traversal through the document, remove the commented line in the handler and rerun the SAX parser. Finally, to fetch the information from the local environment we employ the getStore() function and route the contents into a new object:


R> apple.stock <- h5$getStore()

To verify parsing success, we display the first five rows of the returned data frame:


R> head(apple.stock, 5)
R> # date close 1 2013/11/13 520.634 2 2013/11/12 520.01 3 2013/11/11 519.048 4
R> # 2013/11/08 520.56 5 2013/11/07 512.492

As we have seen, the event-driving parsing works and returns the correct information. Nonetheless, we do not recommend users to resort to this style of parsing as their preferred means to obtain data from XML documents. Although event-style parsing exceeds the DOM-style parsing approach with respect to speed and may, in case of really large XML files, be the only practical method, it necessitates a lot of code overhead as well as background knowledge on R functions and environments. Therefore, for the small- to medium-sized documents that we deal with in this book, in the coming chapters we will focus on the DOM-style parsing and extraction methods provided through the XPath query language (Chapter 4).

3.6 A short example JSON document

In this section, we will become acquainted with the benefits of the data exchange standard JSON. The acronym (pronounced “Jason”) stands for JavaScript Object Notation. JSON was designed for the same tasks that XML is often used for—the storage and exchange of human-readable data. Many APIs by popular web applications provide data in the JSON format.

As its name suggests, JSON is a data format that has its origins in the JavaScript programming language. However, JSON itself is language independent and can be parsed with many existing programming languages, including R. JSON has turned into one of the most popular formats for web data provision. It is therefore worth studying for our purposes. We start again with a synthetic example and continue with a more systematic look at the syntax. In the final part of the chapter, we will learn the JSON syntax and how to access JSON data with R.

The JSON code in Figure 3.9 holds some basic information on the first three Indiana Jones movies. We observe that JSON has a more slender appearance than XML. Data are stored in key/value pairs, for example, "name" :"Raiders of the Lost Ark", which obviates the need for end tags. Different types of brackets (curly and square ones) allow describing hierarchical structures and to differentiate between unordered and ordered data. Just as in XML, JSON data structures can become arbitrarily complex regarding nestedness. Apart from differences in the syntax, JSON is as intuitive as XML, particularly when indented like in the example code, although this is no necessary requirement for valid JSON data.

images

Figure 3.9 JSON code example: Indiana Jones movies

3.7 JSON syntax rules

JSON syntax is easy to learn. We only have to know (a) how brackets are used to structure the data, (b) how keys and values are identified and separated, and (c) which data types exist and how they are used.

Brackets play a crucial role in structuring the document. As we see in the example data in Figure 3.9, the whole document is enclosed in curly brackets. This is because indy movies is the first object that holds the three movie records in an array, that is, an ordered sequence. Arrays are framed by square brackets. The movies, in turn, are also objects and therefore enclosed by curly brackets. In general, brackets work as follows:

  1. Curly brackets, “{” and “},” embrace objects. Objects work much like elements in XML and can contain collections of key/value pairs, other objects, or arrays.
  2. Square brackets, “[” and “],” enclose arrays. An array is an ordered sequence of objects or values.

Actual data are stored in key/value pairs. The rules for keys and values are

  1. Keys are placed in double quotes, data are only placed in double quotes if they are string data

    images

  2. Keys and values are always separated by a colon

    images

  3. Key/value pairs are separated by commas

    images

  4. Values in an array are separated by commas

    images

JSON allows a set of different data types for the value part of key/value pairs. They are listed in Table 3.4.

Table 3.4 Data types in JSON

Data type Meaning
Number integer, real, or floating point (e.g., 1.3E10)
String white space, zero, or more Unicode characters (except " or ; introduces some escape sequences)
Boolean true or false
Null null, an unknown value
Object content in curly brackets
Array ordered content in square brackets

And that is it.10 From the perspective of an XML user, note what is not possible in JSON: We cannot add comments, we do not distinguish between missing values and null values, there are no namespaces and no internal validation syntax like XML's DTD. But this does not make JSON inferior to XML in absolute terms. They are rather based on different concepts. JSON is not a markup language and not even a document format. It is anticipated to be versionless—there is no JSON 1.0—and no change in the grammar is expected. It is just a data interchange standard that is so general that it can be parsed by many languages without effort.

Although there is not much to highlight in JSON data, there are some tools that facilitate accessing JSON documents for human readers. The JSON Formatter & Validator at http://jsonformatter.curiousconcept.com/ is just one of several tools on the Web that automatically indent JSON input. This makes it much easier to read because JSON data frequently come without indentation or line breaks. The tool also helps check for bugs in the data. If you want to convert XML to JSON data, take a look at http://www.freeformatter.com/xml-to-json-converter.html or similar tools. However, such conversions are never isomorphic and rules have to be set to deal with, for example, attributes and namespaces.

Why is JSON so important for the Web even though XML already provides a popular data exchange format? First of all, there are some technical properties that make JSON preferable to XML. Generally, it is more lightweight due to its less verbose syntax and only allows a limited set of data types that are compatible with many if not most existing programming languages. Regarding compatibility, JSON has another crucial feature: We cover only basics of JavaScript in this book (see Chapter 6), but JavaScript is a major player on the Web to generate dynamic content and user–browser interactions. JSON is ultimately compatible with JavaScript and can be directly parsed into JavaScript objects. From a practical point of view, JSON seems to become the most widely used data exchange format for web APIs; Twitter as well as YouTube and many bigger and smaller web services have begun using JSON-only APIs.

3.8 JSON and R in practice

While R has one standard set of tools to handle XML-type data—the XML package—there are several packages that allow importing, exporting, and manipulating JSON data. The first published package was rjson (Couture-Beil 2013) and is still used in some R-based API wrappers. The package that is currently more established, however, is RJSONIO (Temple Lang 2013b), which we will use in this section. Finally, we also discuss the recently published package jsonlite (Ooms and Temple Lang 2014), which builds on RJSONIO and improves mapping between R objects and JSON strings.

We begin the discussion with an inspection of the RJSONIO package. In its current version (1.0.3), the package offers 24 functions, most of which we usually do not apply directly. We now return to the running example, the data in the indy.json file. Using the isValidJSON() function, we first check whether the document consists of valid JSON data:


R> isValidJSON("indy.json")
[1] TRUE

This seems to be the case. The two core functions are fromJSON() and toJSON(). fromJSON() reads content in JSON format and converts it to R objects, toJSON() does the opposite:


R> indy <- fromJSON(content ="indy.json")

content is the function's main argument. In our case, indy.json is a file in the working directory, but it could also be a character string possibly from the Web via getURL() or imported with readLines(). The fromJSON() function offers several other useful arguments, and as the package is well maintained, the documentation—accessible with ?fromJSON—is worth a look. A very useful argument is simplify, controlling whether the function tries to combine identical elements to vectors. Otherwise the individual elements remain separate list elements. The nullValue argument allows specifying how to deal with JSON nulls. In general, JSON data types (see Table 3.4) match R data types nicely (numeric, integer, character, logical). The null value is a little more differentiated in R, however. There is NULL for empty objects and NA for indicating a missing value. Therefore, the nullValue argument helps to specify how to deal with these cases, like turning them into NAs. The function maps the JSON data structure into an R list object:


R> class(indy)
[1]"list"

From this point on we can deal with the data the standard R way, that is, decompose or subset the list or force (parts of) it into vectors, data frames, or other structures. We have already observed that seemingly powerful functions like xmlToDataFrame() can be of limited use when we face real data. Data frames are useful to represent a simple “observations by variables” structure, but become very complex if they are used to represent highly hierarchical data. In contrast, JSON and XML can represent far more complex data structures. When loading JSON or XML data into R, one often has to decide which subsets of information are necessary and need to be inserted into a data frame. Consequently, there cannot be a global and universal function for JSON/XML to R data format conversion. We have to build our own tools case by case. In our example, we might want to try to map the list to a data frame, consisting of three observations and several variables. The problem is that actors and producers have several values. One option is to extract the information variable by variable and merge in the end. This could work as follows:

images

This strategy first flattens the complex list structure into one vector. The recursive argument ensures that all components of the list are unlisted. Since the key names are retained in the vector by setting use.names to TRUE, we can identify all original key/value pairs with the name name using a simple regular expression and the str_detect() function from the stringr package (see also Chapter 8). This strategy has its drawbacks. First, all list elements are coerced to a common mode, resulting in character vectors in most cases. This is useful for the names variable, but less appropriate for the years variable. Further, this step-by-step approach is tedious when many variables have to be extracted. An only slightly more comfortable option uses sapply() and feeds it with the [[ operators and the variable name for element subsetting, calling indy[[1]][[1]][[’name’]], indy[[1]][[2]][[’name’]], and so on:


R> sapply(indy[[1]],"[[","year")
[1] 1981 1984 1989

The benefit of this approach over the first is that data types are retained. Finally, to pull all variables and directly assemble them into a data frame, we have to take into account that some variables do not exist or vary in structure from observation to observation in the sample data. For example, the number of producers varies. We do the conversion as follows:


R> library(plyr)
R> indy.unlist <- sapply(indy[[1]], unlist)
R> indy.df <- do.call("rbind.fill", lapply(lapply(indy.unlist, t), data.frame, stringsAsFactors = FALSE))

We first unlist the elements within the list. The second command is more complex. First, we transpose each list element, turn them into data frames, and finally make use of the rbind.fill() function of the plyr package to combine the data frames into one single data frame, taking care of the fact that some variables do not exist in some data frames. The result reveals that we would have to continue with some data cleansing—note for example the split-up producer variables:

images

It is clear that importing JSON data, or working with lists in general, can be painful. Even if data structures are simpler, we need to use apply functions. Consider this last example of a JSON data import with a simple Peanuts dataset:

images

images

We turn the data into an ordinary data frame with the following expression:


R> peanuts.json <- fromJSON("peanuts.json", nullValue = NA, 
simplify = FALSE )
R> peanuts.df <- do.call("rbind", lapply(peanuts.json, data.frame, 
stringsAsFactors = FALSE))

We parse the JSON snippet with the fromJSON function and tell the parser to set null values to zero. We also set simplify to FALSE in order to retain the list structure in all elements. Otherwise, the parser would convert the second entry to a character vector, rendering the data.frame() apply function useless. We use the lapply() function to turn the lists into data frames and keep strings as strings with the stringsAsFactors = FALSE argument. Finally, we join the data frames with a do.call() on rbind(). The result looks acceptable:

images

To do the conversion the other way round, that is from R to JSON data, the function we need is toJSON():


R> peanuts.json <- toJSON(peanuts.df, pretty = TRUE)
R> file.output <- file("peanuts_out.json")
R> writeLines(peanuts.json, file.output)
R> close(file.output)

While transforming JSON data into appropriate R objects cannot always be done with preexisting functions, but require some postprocessing of the resulting objects, the recently developed jsonlite package offers more consistency between both data structures. It builds upon the parser of the RJSONIO package and provides the main functions fromJSON() and toJSON as well, but implements a different mapping scheme (see Ooms 2013). A set of rules ensures that data from an external source like an API are transformed in a way that guarantees consistent transformations. Some important conventions for JSON-to-R conversions for arrays are

  • arrays are encoded as character data if at least one value is of type character;
  • null values are encoded as NA;
  • true and false values are encoded as 1 and 0 in numerical vectors and TRUE and FALSE in character and logical vectors.

There are more conventions for the transformation of vectors, matrices, lists, and data frames. They are documented in Ooms (2013). For our purposes, the rules concerning JSON-to-R conversion are most important, as this is part of the regular scraping workflow. Consider the following set of transformations from JSON arrays into R objects to see how the conventions cited above work in practice:

images

The consistent mapping rules of jsonlite not only ensure that data are transformed adequately on the vector level, but also make mapping of JSON data into R data frames a lot easier. Reconsidering the Peanuts example with jsonlite, it turns out that the JSON data are conveniently mapped into the desired R object of type data.frame right away:

images

In the Indiana Jones example, the Indy JSON is also mapped into a list. However, the only element in the list is a data frame of the desired content. We simply pull the data frame from the list to access the variables

images

In short, whenever RJSONIO returns a list when you would expect a data frame, jsonlite manages to generate tabular data from JSON data structures as long as it is appropriate, because the mapping scheme acknowledges the way in which tabular data are stored in R, which is column based, and JSON—and many other formats, languages, or databases—which is row based (see Ooms 2013).

To be sure, the functionality of jsonlite does not solve all problems of JSON-to-R transfer. However, the choice of rules implemented in jsonlite makes the import of JSON data into R more consistent. We therefore suggest to make this package the standard tool when working with JSON data even though it is still in an early version.

Summary

Both XML and JSON are very important standards for data exchange on the Web, and as such will occur several times in the course of this book (for example in Chapter 4 and the case study on Twitter, Chapter 14). Knowing how to handle both data types is helpful in many web data collection tasks.

We have seen that XML serves as a basic standard for many other formats, such as GPX, KML, RSS, SVG, XHTML. Whenever we encounter such data on the Web we are able to import and process them in Rtoo. JSON is an increasingly popular alternative to XML for the exchange of data on the Web, especially when working with web services/web APIs. JSON is derived from JavaScript and can be parsed in many languages, including R.

Further reading

There are many books that go far beyond this basic introduction to XML and JSON. If you have acquired a taste for the languages of the Web and plan to go deeper into web developing, you could have a look at XML in a Nutshell by Harold and Means (2004) or at Ray (2003). For the web scraping tasks presented in this book, however, deeper knowledge of XML should not be necessary.

If you want to dig deeper into JSON and JavaScript, the book JavaScript: The Good Parts by JSON developer Douglas Crockford (2008) might be a good start. For a quick overview, the excellent website http://www.json.org/ is highly recommended.

Problems

  1. Describe the relationship between XML and HTML.

  2. What are possible ways to import XML data into R? What are the advantages and disadvantages of each approach?

  3. What is the purpose of namespaces in XML-style documents?

  4. What are the main elements of the JSON syntax?

  5. Write the smallest well-formed XML document you can think of.

  6. Why do we need an escape sequence for the ampersand in XML?

  7. Take a look at the invalid XML code snippet in Section 3.2.2. How could the family structure be represented in a valid XML document so that it is possible to identify Jonathan both as a child and as a father?

  8. Go to your vinyl record, CD, DVD, or Blu-ray Disc shelf and randomly pick three titles. Create an XML document that holds useful information about your sample of discs.

  9. Inform yourself about the Election Markup Language (EML).

    1. Find out the purpose of EML.
    2. Look for the current specification of the language and identify the key concepts.
    3. Search for a real EML document, load it into R and turn parts of it into native data structures.
  10. Working with SVG files.

    1. Manipulate the ricon.svg file such that the icon is framed with a black box. Redefine the color, size, and font of the image.
    2. Rebuild the RSS icon as an SVG document.
  11. Find the formatting errors in the following JSON piece.11

    images

  12. Convert the James Bond XML example from Figure 3.1 into valid JSON.

  13. Convert the Indiana Jones example from Figure 3.9 into valid XML.

  14. Import the indy.json file into R and extract the values of all budget keys.

  15. The XML file potus.xml (available in the book's materials) contains biographical information on US presidents.

    1. Use the DOM-style XML parser and parse the document into an R object called potus. Inspect the source code. The <occupation> nodes contain additional white space at the end of the text string. Find the appropriate argument to remove them in the parsing stage.
    2. The XML file contains <salary> nodes. Discard them while parsing the file. Remove the additional white space in the <occupation> nodes by using a custom handler function and a string manipulation function (see Section 8.2).
    3. Write a handler for extracting the <hometown> nodes’ value and pass it to the DOM-style parser. Repeat the process with an event-driven parser. Inspect the results.

Notes

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.154.89