XML, like JSON, is an absolutely ubiquitous format for data transfer over the Internet. In addition to being used on the web, XML is also a popular data format for application configuration files and the list. In fact, newer Microsoft Office documents (with the extension .docx
or .xlsx
) are stored as XML files.
Here's what our simple Beatles dataset may look like in XML:
example_xml1 <- ' <the_beatles> <formed>1960</formed> <members> <member> <first_name>George</first_name> <last_name>Harrison</last_name> </member> <member> <first_name>Ringo</first_name> <last_name>Starr</last_name> </member> <member> <first_name>Paul</first_name> <last_name>McCartney</last_name> </member> <member> <first_name>John</first_name> <last_name>Lennon</last_name> </member> </members> </the_beatles>'
Much like JSON, XML is stored in a tree structure—this is called a DOM (Document Object Model) tree in XML parlance. Each piece of information in an XML document—surrounded by names in angle brackets—is called an element or node. In the hierarchical structure, subnodes are called children. In the preceding code, formed
is a child of the_beatles
, and member
is a child of members
. Each node may have zero or more children who may have children nodes of their own. For example, the members
node has four children, each of whom have two children, first_name
and last_name
. The common parent of all the elements (whether direct parent or great-great-grandparent) is the root node, which doesn't have a parent.
As with JSON, XML and XML import functions is an enormous topic. We'll only briefly cover some of the more common and basic know-how in this chapter. Fortunately, R has a built-in help and documentation. For this package, help(package="XML")
indicates that more documentation is available at the package's URL: http://www.omegahat.net/RSXML/
We will read the preceding XML with the XML
package. If you don't have it already, make sure you install it.
library(XML) the_beatles <- xmlTreeParse(example_xml1) print(names(the_beatles)) ------------------- [1] "doc" "dtd" print(the_beatles$doc) --------------------- $file [1] "<buffer>" $version [1] "1.0" $children $children$the_beatles <the_beatles> <formed>1960</formed> <members> <member> <first_name>George</first_name> <last_name>Harrison</last_name> </member> .......... </members> </the_beatles> attr(,"class") [1] "XMLDocumentContent"
xmlTreeParse
reads and parses the DOM, and stores it as an R list. The actual content is stored in the children
attribute of the doc
attribute. We can access the year The Beatles were formed like so:
print(xmlValue(the_beatles$doc$children$the_beatles[["formed"]])) ---------------------- [1] "1960"
Here, we use the xmlValue
function to extract the value stored in the formed
node.
If we wanted to get to the first names of all the members, we have to store the root node of the DOM, and iterate over the children of the members
node. In particular, we use the sapply
function (which applies a function to each element of a vector) over the children with a function that returns the xml value of the first_name
node. Concretely:
root <- xmlRoot(the_beatles) sapply(xmlChildren(root[["members"]]), function(x){ xmlValue(x[["first_name"]]) }) ------------------------------------------- member member member member "George" "Ringo" "Paul" "John"
Though it's possible to work with the DOM in this manner, it is much more common to interrogate XML using XPath.
XPath is kind of like an XML query language—like SQL, but for XML. It allows us to select nodes that match a particular pattern or location. For matching, it uses path expressions that identify nodes based on their name, location, or relationships with other nodes.
This powerful tool also comes with a proportionally steep learning curve. Luckily, it is somewhat easy to get started. In addition, there are a lot of great tutorials online. The excellent tutorial that taught me XPath is available at http://www.w3schools.com/xsl/xpath_intro.asp.
To use XPath, we have to re-import the XML using the xmlParse
(not XMLTreeParse
) function, which uses a different optimized internal representation. To replicate the results of the previous code snippet using XPath, we are going to use the following XPath statement:
all_first_names <- "//member/first_name"
The preceding statement roughly translates to "for all member
nodes anywhere occurring anywhere in the document, get the child node named first_name
".
the_beatles <- xmlParse(example_xml1) getNodeSet(the_beatles, all_first_names) -------- [[1]] <first_name>George</first_name> [[2]] <first_name>Ringo</first_name> [[3]] <first_name>Paul</first_name> [[4]] <first_name>John</first_name> attr(,"class") [1] "XMLNodeSet"
Equivalent XPath expressions could also be written thus:
getNodeSet(the_beatles, "//first_name") getNodeSet(the_beatles, "/the_beatles/members/member/first_name")
And just the XML values for each node can be extracted thus:
sapply(getNodeSet(the_beatles, all_first_names), xmlValue) ------------------------------- [1] "George" "Ringo" "Paul" "John"
There is more than one way to represent the same information in XML. The following XML is another way of representing the same data about The Beatles. This uses XML attributes instead of nodes for formed
, first_name
, and last_name
:
example_xml2 <- ' <the_beatles formed="1990"> <members> <member first_name="George" last_name="Harrison"/> <member first_name="Richard" last_name="Starkey"/> <member first_name="Paul" last_name="McCartney"/> <member first_name="John" last_name="Lennon"/> </members> </the_beatles>'
In this case, retrieving a vector of all first names can be done using this snippet:
sapply(getNodeSet(the_beatles, "//member[@first_name]"), function(x){ xmlAttrs(x)[["first_name"]] }) ----------- [1] "George" "Richard" "Paul" "John"
It may help understanding of XML processing in R to use it in a real-life example.
There is a repository of music information called MusicBrainz (http://musicbrainz.org). Like Last.fm, this website kindly allows custom queries against their info database, and returns the results in XML format.
We will use this service to extend the recommendation system that we created just using tags from Last.fm by combining them with tags from MusicBrainz.
To query the database for a particular artist, the format is as follows:
http://musicbrainz.org/ws/2/artist/?query=artist:<THE_ARTIST>
For example, the query for Kate Bush is: http://musicbrainz.org/ws/2/artist/?query=artist:Kate%20BushIf you visit that link, you'll see that it returns an XML document that contains a list of artists that match the search to varying degrees. The list contains, among others, John Bush, Shelly Bush, and Bush. Luckily, the matches are in order of descending matchiness and, for all the artists that we'll be working with, the correct artist is the first artist in the node artist-list
.
In case you can't view the link yourself, the following is essentially the structure of it:
<metadata xmlns="http://musicbrainz.org/ns/mmd-2.0#"> <artist-list> <artist> <name>Kate Bush</name> <tag-list> <tag count="1"> <name>kent</name> </tag> <tag count="1"> <name>english</name> </tag> <tag count="3"> <name>british</name> </tag> </tag-list> </artist> <artist-list> </metadata>
This means that the XPath expressions that selects all the tags (of the first artist) is given by: //artist[1]/tag-list/tag/name
As with JSON/Last.fm, let's write the function that, for any given artist, returns the appropriate query URL:
create_artist_query_url_mb <- function(artist){ encoded_artist <- URLencode(artist) return(paste0("http://musicbrainz.org/ws/2/artist/?query=artist:", encoded_artist)) } create_artist_query_url_mb("Depeche Mode") ------- [1] "http://musicbrainz.org/ws/2/artist/?query=artist:Depeche%20Mode"
Now, let's write the function that returns the list of tags for a particular artist.
Because nothing is ever easy, the XPath mentioned in the preceding code will not work as is. This is because the MusicBrainz XML uses an XML namespace. Though it makes our job (marginally) harder, an XML namespace is generally a good thing, because it eliminates ambiguity when referring to element names between different XML documents whose element names are arbitrarily defined by the developer.
As the response suggests, the namespace is given by http://musicbrainz.org/ns/mmd-2.0#
. In order to use this in our tag extraction function and XPath selecting, we need to store and name this namespace first:
ns <- "http://musicbrainz.org/ns/mmd-2.0#" names(ns)[1] <- "ns"
Now we have all we need to write the Music Brainz counterpart to the get_tag_vector_lfm
function.
get_tag_vector_mb <- function(an_artist, ns){ artist_url <- create_artist_query_url_mb(an_artist) the_xml <- xmlParse(artist_url) xpath <- "//ns:artist[1]/ns:tag-list/ns:tag/ns:name" the_nodes <- getNodeSet(the_xml, xpath, ns) return(unlist(lapply(the_nodes, xmlValue))) } get_tag_vector_mb("Depeche Mode", ns) ------------------------------------- [1] "electronica" "post punk" "alternative dance" [4] "electronic" "dark wave" "britannique" ............
Like fromJSON
, xmlParse
handles URLs natively.
Let's finish this up:
our_artists <- list("Kate Bush", "Peter Tosh", "Radiohead", "The Smiths", "The Cure", "Black Uhuru") our_artists_tags_mb <- lapply(our_artists, get_tag_vector_mb, ns) names(our_artists_tags_mb) <- our_artists sim_matrix <- similarity_matrix(our_artists_tags_mb, jaccard_index) print(sim_matrix) ------- Kate Bush Peter Tosh Radiohead The Smiths The Cure Black Uhuru Kate Bush 1.00 0.00 0.24 0.27 0.24 0.00 Peter Tosh 0.00 1.00 0.00 0.00 0.00 0.17 Radiohead 0.24 0.00 1.00 0.23 0.23 0.00 The Smiths 0.27 0.00 0.23 1.00 0.38 0.00 The Cure 0.24 0.00 0.23 0.38 1.00 0.00 Black Uhuru 0.00 0.17 0.00 0.00 0.00 1.00 > sim_matrix[order(sim_matrix[,4], decreasing=TRUE), 4] ------------------------------- The Smiths The Cure Kate Bush Radiohead Peter Tosh Black Uhuru 1.00 0.38 0.27 0.23 0.00 0.00
This yields results that are quite similar to the recommendation system that uses tags from only Last.fm. Personally, I like the former better, but how about we combine both? We can do this easily by taking the set intersection of artists' tags between the two services.
for(i in 1:length(our_artists_tags)){ the_artist <- names(our_artists_tags)[i] # the_artist now holds the current artist's name combined_tags <- union(our_artists_tags[[the_artist]], our_artists_tags_mb[[the_artist]]) our_artists_tags[[the_artist]] <- combined_tags } sim_matrix <- similarity_matrix(our_artists_tags, jaccard_index) print(sim_matrix) -------- Kate Bush Peter Tosh Radiohead The Smiths The Cure Black Uhuru Kate Bush 1.00 0.04 0.29 0.24 0.19 0.03 Peter Tosh 0.04 1.00 0.01 0.03 0.03 0.29 Radiohead 0.29 0.01 1.00 0.29 0.30 0.03 The Smiths 0.24 0.03 0.29 1.00 0.40 0.05 The Cure 0.19 0.03 0.30 0.40 1.00 0.05 Black Uhuru 0.03 0.29 0.03 0.05 0.05 1.00
Super!
3.137.223.82