Using JSON

JavaScript Object Notation (JSON) is a standardized human-readable data format that plays an enormous role in communication between web browsers to web servers. JSON was originally borne out of a need to represent arbitrarily complex data structures in JavaScript—a web scripting language—but it has since grown into a language agnostic data serialization format.

It is a common need to import and parse JSON in R, particularly when working with web data. For example, it is very common for websites to offer web services that take an arbitrary query from a web browser, and return the response as JSON. We will see an example of this very use case later in this section.

For our first look into JSON parsing for R, we'll use the jsonlite package to read a small JSON string, which serializes some information about the best musical act in history, The Beatles:

library(jsonlite)

example.json <- '
{
  "thebeatles": {
    "formed": 1960,
    "members": [
      {
        "firstname": "George",
        "lastname": "Harrison"
      },
      {
        "firstname": "Ringo",
        "lastname": "Starr"
      },
      {
        "firstname": "Paul",
        "lastname": "McCartney"
      },
      {
        "firstname": "John",
        "lastname": "Lennon"
      }
    ]
  }
}'

the_beatles <- fromJSON(example.json)
print(the_beatles)
---------------------
$thebeatles
$thebeatles$formed
[1] 1960

$thebeatles$members
  firstname  lastname
1    George  Harrison
2     Ringo     Starr
3      Paul McCartney
4      John    Lennon

We used the fromJSON function to read in the string. The result is an R list, whose elements/attributes can be accessed via the $ operator, or the [[ double square bracket function/operator. For example, we can access the date when The Beatles formed, in R, in the following two ways:

the_beatles$thebeatles$formed
the_beatles[["thebeatles"]][["formed"]]
---------
[1] 1960
[1] 1960

Note

In R, a list is a data structure that is kind of like a vector, but allows elements of differing data types. A single list may contain numerics, strings, vectors, or even other lists!

Now that we have the very basics of handling JSON down, let's move on to using it in a non-trivial manner!

There's a music/social-media-platform called http://www.last.fm that kindly provides a web service API that's free for public use (as long as you abide by their reasonable terms). This API (Application Programming Interface) allows us to query various points of data about musical artists by crafting special URLs. The results of following these URLs are either a JSON or XML payload, which are directly consumable from R.

In this non-trivial example of using web data, we will be building a rudimentary recommendation system. Our system will allow us to suggest new music to a particular person based on an artist that they already like. In order to do this, we have to query the Last.fm API to gather all the tags associated with particular artists. These tags function a lot like genre classifications. The success of our recommendation system will be predicated on the assumption that musical artists with overlapping tags are more similar to each other than artists with disparate tags, and that someone is more likely to enjoy similar artists than an arbitrary dissimilar artist.

Here's an example JSON excerpt of the result of querying the API for tags of a particular artist:

{
  "toptags": {
    "tag": [
      {
        "count": 100,
        "name": "female vocalists",
        "url": "http://www.last.fm/tag/female+vocalists"
      },
      {
        "count": 71,
        "name": "singer-songwriter",
        "url": "http://www.last.fm/tag/singer-songwriter"
      },
      {
        "count": 65,
        "name": "pop",
        "url": "http://www.last.fm/tag/pop"
      }
    ]
  }
}

Here, we only care about the name of the tag—not the URL, or the count of occasions Last.fm users applied each tag to the artist.

Let's first create a function that will construct the properly formatted query URL for a particular artist. The Last.fm developer website indicates that the format is:

http://ws.audioscrobbler.com/2.0/?method=artist.gettoptags&artist=<THE_ARTIST>&api_key=c2e57923a25c03f3d8b317b3c8622b43&format=json

In order to create these URLs based upon arbitrary input, we can use the paste0 function to concatenate the component strings. However, URLs can't handle certain characters such as spaces; in order to convert the artist's name to a format suitable for a URL, we'll use the URLencode function from the (preloaded) utils package.

URLencode("The Beatles")
-------
[1] "The%20Beatles"

Now we have all the pieces to put this function together:

create_artist_query_url_lfm <- function(artist_name){
  prefix <- "http://ws.audioscrobbler.com/2.0/?method=artist.gettoptags&artist="
  postfix <- "&api_key=c2e57923a25c03f3d8b317b3c8622b43&format=json"
  encoded_artist <- URLencode(artist_name)
  return(paste0(prefix, encoded_artist, postfix))
}

create_artist_query_url_lfm("Depeche Mode")
--------------------
[1] "http://ws.audioscrobbler.com/2.0/?method=artist.gettoptags&artist=Depeche%20Mode&api_key=c2e57923a25c03f3d8b317b3c8622b43&format=json"

Fantastic! Now we make the web request, and parse the resulting JSON. Luckily, the fromJSON function that we've been using can take a URL and automatically make the web request for us. Let's see what it looks like:

fromJSON(create_artist_query_url_lfm("Depeche Mode"))

-----------------------------------------

$toptags
$toptags$tag
   count              name                                      url
1    100        electronic        http://www.last.fm/tag/electronic
2     87          new wave          http://www.last.fm/tag/new+wave
3     59               80s               http://www.last.fm/tag/80s
4     56         synth pop         http://www.last.fm/tag/synth+pop
  ........

Neat-o! If you take a close look at the structure, you'll see that the tag names are stored in the name attribute of the tag attribute of the toptags attribute (whew!). This means we can extract just the tag names with $toptags$tag$name. Let's write a function that will take an artist's name, and return a list of the tags in a vector.

get_tag_vector_lfm <- function(an_artist){
  artist_url <- create_artist_query_url_lfm(an_artist)
  json <- fromJSON(artist_url)
  return(json$toptags$tag$name)
}

get_tag_vector_lfm("Depeche Mode")

------------------------------------------

 [1] "electronic"        "new wave"          "80s"              
 [4] "synth pop"         "synthpop"          "seen live"        
 [7] "alternative"       "rock"              "british"          
  ........

Next, we have to go ahead and retrieve the tags for all artists. Instead of doing this (and probably violating Last.fm's terms of service), we'll just pretend that there are only six musical artists in the world. We'll store all of these artists in a list. This will make it easy to use the lapply function to apply the get_tag_vector_lfm function to each artist in the list. Finally, we'll name all the elements in the list appropriately:

our_artists <- list("Kate Bush", "Peter Tosh", "Radiohead",
                     "The Smiths", "The Cure", "Black Uhuru")
our_artists_tags <- lapply(our_artists, get_tag_vector_lfm)
names(our_artists_tags) <- our_artists

print(our_artists_tags)

--------------------------------------

$`Kate Bush`
 [1] "female vocalists"  "singer-songwriter" "pop"              
 [4] "alternative"       "80s"               "british"          
  ........
$`Peter Tosh`
 [1] "reggae"            "roots reggae"      "Rasta"            
 [4] "roots"             "ska"               "jamaican"         
  ........
$Radiohead
 [1] "alternative"           "alternative rock"     
 [3] "rock"                  "indie"                
  ........
$`The Smiths`
 [1] "indie"             "80s"               "post-punk"        
 [4] "new wave"          "alternative"       "rock"             
  ........
$`The Cure`
 [1] "post-punk"        "new wave"         "alternative"     
 [4] "80s"              "rock"             "seen live"       
  ........
$`Black Uhuru`
 [1] "reggae"            "roots reggae"      "dub"              
 [4] "jamaica"           "roots"             "jamaican"         
  ........

Now that we have all the artists' tags stored as a list of vectors, we need some way of comparing the tag lists, and judge them for similarity.

The first idea that may come to mind is to count the number of tags each pair of artists have in common. Though this may seem like a good idea at first glance, consider the following scenario:

Artist A and artist B have hundreds of tags each, and they share three tags in common; artist C and D each have two tags, both of which are mutually shared. Our naive metric for similarity suggests that artists A and B are more similar than C and D (by 50%). If your intuition tells you that C and D are more similar, though, we are both in agreement.

To make our similarity measure comport more with our intuition, we will instead use the Jaccard index. The Jaccard index (also Jaccard coefficient) between sets A and B, Using JSON, is given by:

Using JSON

where Using JSON is the set intersection (the common tags), Using JSON is the set union (an unduplicated list of all the tags in both sets), and Using JSON is the set X's cardinality (the number of elements in that set).

This metric has the attractive property that it is naturally constrained:

Using JSON

Let's write a function that takes two sets, and returns the Jaccard index. We'll employ the built-in functions intersect and union.

jaccard_index <- function(one, two){
  length(intersect(one, two))/length(union(one, two))
}

Let's try it on The Cure and Radiohead:

jaccard_index(our_artists_tags[["Radiohead"]],
              our_artists_tags[["The Cure"]])

---------------

[1] 0.3333

Neat! Manual checking confirms that this is the right answer.

The next step is to construct a similarity matrix. This is a Using JSON matrix (where Using JSON is the number of artists) that depicts all the pairwise similarity measurements. If this explanation is confusing, look at the code output before reading the following code snippet:

similarity_matrix <- function(artist_list, similarity_fn) {
    num <- length(artist_list)
    
    # initialize a num by num matrix of zeroes
    sim_matrix <- matrix(0, ncol = num, nrow = num)
    
    # name the rows and columns for easy lookup
    rownames(sim_matrix) <- names(artist_list)
    colnames(sim_matrix) <- names(artist_list)

    # for each row in the matrix
    for(i in 1:nrow(sim_matrix)) {
        # and each column
        for(j in 1:ncol(sim_matrix)) {
            # calculate that pair's similarity
            the_index <- similarity_fn(artist_list[[i]],
                                       artist_list[[j]])
            # and store it in the right place in the matrix
            sim_matrix[i,j] <- round(the_index, 2)
      }
    }
    return(sim_matrix)
}


sim_matrix <- similarity_matrix(our_artists_tags, jaccard_index)
print(sim_matrix)

--------------------------------------------------------------


       Kate Bush Peter Tosh Radiohead The Smiths The Cure Black Uhuru
Kate Bush      1.00     0.05     0.31     0.25     0.21     0.04
Peter Tosh     0.05     1.00     0.02     0.03     0.03     0.33
Radiohead      0.31     0.02     1.00     0.31     0.33     0.04
The Smiths     0.25     0.03     0.31     1.00     0.44     0.05
The Cure       0.21     0.03     0.33     0.44     1.00     0.05
Black Uhuru    0.04     0.33     0.04     0.05     0.05     1.00

If you're familiar with some of these bands, you'll no doubt see that the similarity matrix in the preceding output makes a lot of prima facie sense—it looks like our theory is sound!

If you notice, the values along the diagonal (from the upper-left point to the lower-right) are all 1. This is because the Jaccard index of two identical sets is always 1—and artists' similarity with themselves is always 1. Additionally, all the values are symmetric with respect to the diagonal; whether you look up Peter Tosh and Radiohead by column and then row, or vice versa, the value will be the same (.02). This property means that the matrix is symmetric. This is a property of all similarity matrices using symmetric (commutative) similarity functions.

Note

A similar (and perhaps more common) concept is that of a distance matrix (or dissimilarity matrix). The idea is the same, but now the values that are higher will refer to more musically distant pairs of artists. Also, the diagonal will be zeroes, since an artist is the least musically different from themselves than any other artist. If all the values of a similarity matrix are between 0 and 1 (as is often the case), you can easily make it into a distance matrix by subtracting 1 from every element. Subtracting from 1 again will yield the original similarity matrix.

Recommendations can now be furnished, for listeners of one of the bands, by sorting that artist's column in the matrix in a descending order; for example, if a user likes The Smiths, but is unsure what other bands she should try listening to:

# The Smiths are the fourth column
sim_matrix[order(sim_matrix[,4], decreasing=TRUE), 4]

----------------------------------------------

The Smiths    The Cure   Radiohead   Kate Bush Black Uhuru 
       1.00       0.44        0.31        0.25        0.05 
 Peter Tosh 
       0.03

Of course, a recommendation of The Smiths for this user is nonsensical. Going down the list, it looks like a recommendation of The Cure is the safest bet, though Radiohead and Kate Bush may also be fine recommendations. Black Uhuru and Peter Tosh are unsafe bets if all we know about the user's a fondness for The Smiths.

Using JSON
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.193.221