JavaScript Object Notation (JSON) is a standardized human-readable data format that plays an enormous role in communication between web browsers to web servers. JSON was originally borne out of a need to represent arbitrarily complex data structures in JavaScript—a web scripting language—but it has since grown into a language agnostic data serialization format.
It is a common need to import and parse JSON in R, particularly when working with web data. For example, it is very common for websites to offer web services that take an arbitrary query from a web browser, and return the response as JSON. We will see an example of this very use case later in this section.
For our first look into JSON parsing for R, we'll use the jsonlite
package to read a small JSON string, which serializes some information about the best musical act in history, The Beatles:
library(jsonlite) example.json <- ' { "thebeatles": { "formed": 1960, "members": [ { "firstname": "George", "lastname": "Harrison" }, { "firstname": "Ringo", "lastname": "Starr" }, { "firstname": "Paul", "lastname": "McCartney" }, { "firstname": "John", "lastname": "Lennon" } ] } }' the_beatles <- fromJSON(example.json) print(the_beatles) --------------------- $thebeatles $thebeatles$formed [1] 1960 $thebeatles$members firstname lastname 1 George Harrison 2 Ringo Starr 3 Paul McCartney 4 John Lennon
We used the fromJSON
function to read in the string. The result is an R list, whose elements/attributes can be accessed via the $
operator, or the [[ double square bracket function/operator. For example, we can access the date when The Beatles formed, in R, in the following two ways:
the_beatles$thebeatles$formed the_beatles[["thebeatles"]][["formed"]] --------- [1] 1960 [1] 1960
Now that we have the very basics of handling JSON down, let's move on to using it in a non-trivial manner!
There's a music/social-media-platform called http://www.last.fm that kindly provides a web service API that's free for public use (as long as you abide by their reasonable terms). This API (Application Programming Interface) allows us to query various points of data about musical artists by crafting special URLs. The results of following these URLs are either a JSON or XML payload, which are directly consumable from R.
In this non-trivial example of using web data, we will be building a rudimentary recommendation system. Our system will allow us to suggest new music to a particular person based on an artist that they already like. In order to do this, we have to query the Last.fm API to gather all the tags associated with particular artists. These tags function a lot like genre classifications. The success of our recommendation system will be predicated on the assumption that musical artists with overlapping tags are more similar to each other than artists with disparate tags, and that someone is more likely to enjoy similar artists than an arbitrary dissimilar artist.
Here's an example JSON excerpt of the result of querying the API for tags of a particular artist:
{ "toptags": { "tag": [ { "count": 100, "name": "female vocalists", "url": "http://www.last.fm/tag/female+vocalists" }, { "count": 71, "name": "singer-songwriter", "url": "http://www.last.fm/tag/singer-songwriter" }, { "count": 65, "name": "pop", "url": "http://www.last.fm/tag/pop" } ] } }
Here, we only care about the name of the tag—not the URL, or the count of occasions Last.fm users applied each tag to the artist.
Let's first create a function that will construct the properly formatted query URL for a particular artist. The Last.fm developer website indicates that the format is:
http://ws.audioscrobbler.com/2.0/?method=artist.gettoptags&artist=<THE_ARTIST>&api_key=c2e57923a25c03f3d8b317b3c8622b43&format=json
In order to create these URLs based upon arbitrary input, we can use the paste0
function to concatenate the component strings. However, URLs can't handle certain characters such as spaces; in order to convert the artist's name to a format suitable for a URL, we'll use the URLencode
function from the (preloaded) utils
package.
URLencode("The Beatles") ------- [1] "The%20Beatles"
Now we have all the pieces to put this function together:
create_artist_query_url_lfm <- function(artist_name){ prefix <- "http://ws.audioscrobbler.com/2.0/?method=artist.gettoptags&artist=" postfix <- "&api_key=c2e57923a25c03f3d8b317b3c8622b43&format=json" encoded_artist <- URLencode(artist_name) return(paste0(prefix, encoded_artist, postfix)) } create_artist_query_url_lfm("Depeche Mode") -------------------- [1] "http://ws.audioscrobbler.com/2.0/?method=artist.gettoptags&artist=Depeche%20Mode&api_key=c2e57923a25c03f3d8b317b3c8622b43&format=json"
Fantastic! Now we make the web request, and parse the resulting JSON. Luckily, the fromJSON
function that we've been using can take a URL and automatically make the web request for us. Let's see what it looks like:
fromJSON(create_artist_query_url_lfm("Depeche Mode")) ----------------------------------------- $toptags $toptags$tag count name url 1 100 electronic http://www.last.fm/tag/electronic 2 87 new wave http://www.last.fm/tag/new+wave 3 59 80s http://www.last.fm/tag/80s 4 56 synth pop http://www.last.fm/tag/synth+pop ........
Neat-o! If you take a close look at the structure, you'll see that the tag names are stored in the name
attribute of the tag
attribute of the toptags
attribute (whew!). This means we can extract just the tag names with $toptags$tag$name
. Let's write a function that will take an artist's name, and return a list of the tags in a vector.
get_tag_vector_lfm <- function(an_artist){ artist_url <- create_artist_query_url_lfm(an_artist) json <- fromJSON(artist_url) return(json$toptags$tag$name) } get_tag_vector_lfm("Depeche Mode") ------------------------------------------ [1] "electronic" "new wave" "80s" [4] "synth pop" "synthpop" "seen live" [7] "alternative" "rock" "british" ........
Next, we have to go ahead and retrieve the tags for all artists. Instead of doing this (and probably violating Last.fm's terms of service), we'll just pretend that there are only six musical artists in the world. We'll store all of these artists in a list. This will make it easy to use the lapply
function to apply the get_tag_vector_lfm
function to each artist in the list. Finally, we'll name all the elements in the list appropriately:
our_artists <- list("Kate Bush", "Peter Tosh", "Radiohead", "The Smiths", "The Cure", "Black Uhuru") our_artists_tags <- lapply(our_artists, get_tag_vector_lfm) names(our_artists_tags) <- our_artists print(our_artists_tags) -------------------------------------- $`Kate Bush` [1] "female vocalists" "singer-songwriter" "pop" [4] "alternative" "80s" "british" ........ $`Peter Tosh` [1] "reggae" "roots reggae" "Rasta" [4] "roots" "ska" "jamaican" ........ $Radiohead [1] "alternative" "alternative rock" [3] "rock" "indie" ........ $`The Smiths` [1] "indie" "80s" "post-punk" [4] "new wave" "alternative" "rock" ........ $`The Cure` [1] "post-punk" "new wave" "alternative" [4] "80s" "rock" "seen live" ........ $`Black Uhuru` [1] "reggae" "roots reggae" "dub" [4] "jamaica" "roots" "jamaican" ........
Now that we have all the artists' tags stored as a list of vectors, we need some way of comparing the tag lists, and judge them for similarity.
The first idea that may come to mind is to count the number of tags each pair of artists have in common. Though this may seem like a good idea at first glance, consider the following scenario:
Artist A and artist B have hundreds of tags each, and they share three tags in common; artist C and D each have two tags, both of which are mutually shared. Our naive metric for similarity suggests that artists A and B are more similar than C and D (by 50%). If your intuition tells you that C and D are more similar, though, we are both in agreement.
To make our similarity measure comport more with our intuition, we will instead use the Jaccard index. The Jaccard index (also Jaccard coefficient) between sets A and B, , is given by:
where is the set intersection (the common tags), is the set union (an unduplicated list of all the tags in both sets), and is the set X's cardinality (the number of elements in that set).
This metric has the attractive property that it is naturally constrained:
Let's write a function that takes two sets, and returns the Jaccard index. We'll employ the built-in functions intersect
and union
.
jaccard_index <- function(one, two){ length(intersect(one, two))/length(union(one, two)) }
Let's try it on The Cure and Radiohead:
jaccard_index(our_artists_tags[["Radiohead"]], our_artists_tags[["The Cure"]]) --------------- [1] 0.3333
Neat! Manual checking confirms that this is the right answer.
The next step is to construct a similarity matrix. This is a matrix (where is the number of artists) that depicts all the pairwise similarity measurements. If this explanation is confusing, look at the code output before reading the following code snippet:
similarity_matrix <- function(artist_list, similarity_fn) { num <- length(artist_list) # initialize a num by num matrix of zeroes sim_matrix <- matrix(0, ncol = num, nrow = num) # name the rows and columns for easy lookup rownames(sim_matrix) <- names(artist_list) colnames(sim_matrix) <- names(artist_list) # for each row in the matrix for(i in 1:nrow(sim_matrix)) { # and each column for(j in 1:ncol(sim_matrix)) { # calculate that pair's similarity the_index <- similarity_fn(artist_list[[i]], artist_list[[j]]) # and store it in the right place in the matrix sim_matrix[i,j] <- round(the_index, 2) } } return(sim_matrix) } sim_matrix <- similarity_matrix(our_artists_tags, jaccard_index) print(sim_matrix) -------------------------------------------------------------- Kate Bush Peter Tosh Radiohead The Smiths The Cure Black Uhuru Kate Bush 1.00 0.05 0.31 0.25 0.21 0.04 Peter Tosh 0.05 1.00 0.02 0.03 0.03 0.33 Radiohead 0.31 0.02 1.00 0.31 0.33 0.04 The Smiths 0.25 0.03 0.31 1.00 0.44 0.05 The Cure 0.21 0.03 0.33 0.44 1.00 0.05 Black Uhuru 0.04 0.33 0.04 0.05 0.05 1.00
If you're familiar with some of these bands, you'll no doubt see that the similarity matrix in the preceding output makes a lot of prima facie sense—it looks like our theory is sound!
If you notice, the values along the diagonal (from the upper-left point to the lower-right) are all 1. This is because the Jaccard index of two identical sets is always 1—and artists' similarity with themselves is always 1. Additionally, all the values are symmetric with respect to the diagonal; whether you look up Peter Tosh and Radiohead by column and then row, or vice versa, the value will be the same (.02). This property means that the matrix is symmetric. This is a property of all similarity matrices using symmetric (commutative) similarity functions.
A similar (and perhaps more common) concept is that of a distance matrix (or dissimilarity matrix). The idea is the same, but now the values that are higher will refer to more musically distant pairs of artists. Also, the diagonal will be zeroes, since an artist is the least musically different from themselves than any other artist. If all the values of a similarity matrix are between 0 and 1 (as is often the case), you can easily make it into a distance matrix by subtracting 1 from every element. Subtracting from 1 again will yield the original similarity matrix.
Recommendations can now be furnished, for listeners of one of the bands, by sorting that artist's column in the matrix in a descending order; for example, if a user likes The Smiths, but is unsure what other bands she should try listening to:
# The Smiths are the fourth column sim_matrix[order(sim_matrix[,4], decreasing=TRUE), 4] ---------------------------------------------- The Smiths The Cure Radiohead Kate Bush Black Uhuru 1.00 0.44 0.31 0.25 0.05 Peter Tosh 0.03
Of course, a recommendation of The Smiths for this user is nonsensical. Going down the list, it looks like a recommendation of The Cure is the safest bet, though Radiohead and Kate Bush may also be fine recommendations. Black Uhuru and Peter Tosh are unsafe bets if all we know about the user's a fondness for The Smiths.
3.128.172.168