Venue graph – where do people go next?

Our next use case on Foursquare data is geared more towards creative data extraction. We will demonstrate how the combination of some creativity with the basic data can give rise to unusual datasets. The base data of Foursquare is not really suitable for extracting a graph-based dataset. But a close examination of the APIs reveals an end point which will give the next five venues people go to from any given venue. This can be combined with a graph search algorithm such as a depth-first search to create a graph in which venues can be linked to the next possible venues.

To extract this data, we will use our two utility functions:

  • extract_venue_details: This function will get us the venue details of each venue occurring in our traversal
  • extract_next_venue_details: This function will get us information about the next five venues to which users go from a particular venue
  • extract_dfs_data: This the implementation of a depth-first search in R which will take a vector of seed venues and a level of depth to which to search for new venues

For the data extraction, we will start with three different starting points (we will choose John F. Kennedy Airport, Central Park, and the Statue of Liberty as starting points in New York). With these starting points, we will use our utility functions to arrive at two DataFrame and persists them:

  • Edge_list : This DataFrame will contain edges from the source venue to the next venue
  • Venue_detail: This DataFrame will contain details about each of the nodes that occur in our edge list:
    # Graph data extraction 
    jfk_id = "xxxxxxxxxxxxxxxxxxxxxxxx"
    statue_ofLiberty_id = " xxxxxxxxxxxxxxxxxxxxxxxx "
    central_park_id = " xxxxxxxxxxxxxxxxxxxxxxxx "
    venues_to_crawl = c(jfk_id,statue_ofLiberty_id,central_park_id)
    extract_dfs_data(venues_to_crawl, depth = 10)

Once we have extracted the graph data, we can load and enhance data by finding out the distance between the nodes in each link. This information can serve as the edge weight and can be used in the graph-based analysis that we perform. So we will load up the two DataFrame that we persisted, perform a little cleanup, and then call our utility function to find the distance between each pair of nodes in the edge list:

# Load data sets
edges_list_final <- read_delim("edges_list_final.csv",";", escape_double = FALSE, trim_ws = TRUE)
venue_details_final <- read_delim("venue_details_final.csv",",", escape_double = FALSE, trim_ws = TRUE)
# Clean up the edge list to remove nodes for information was not extracted
venue_details_final <- venue_details_final[!duplicated(venue_details_final$venue.id),]
edges_list_final <-edges_list_final[edges_list_final$NodeFrom %in% venue_details_final$venue.id,]
edges_list_final <-edges_list_final[edges_list_final$NodeTo %in% venue_details_final$venue.id,]
edges_list_final$distance <- apply(edges_list_final, 1, get_distances_between_nodes)

The distance that we will add between the two nodes is not the typical driving distance between them. We will use the longitude and latitude information of the venues to find out the distance as the crow flies. With the distance information added in, we can straight away do a little descriptive analysis. We can try to find out which are the venues in New York to which the maximum number of Foursquare users prefer starting from another venue. Also we can find the average distance that users will cover to reach these venues. The following snippet will find out the 10 most venues which end up being a target node:

# Most visited locations in New York
prominent_next_venues <- edges_list_final %>% 
             group_by(NodeTo) %>% 
             summarise(avg_distance_from_last_venue = mean(distance),num_targets = n()) %>%
             arrange(desc(num_targets)) %>%
             top_n(10)
colnames(prominent_next_venues)[1]<- "venue.id"

prominent_next_venues <- prominent_next_venues %>% 
                         inner_join(venue_details_final) %>% 
                         select(venue.id, venue.name, avg_distance_from_last_venue, num_targets)

The result is not a very surprising one. The popular places of New York are the ones people usually go to from other venues. The result in a DataFrame is as follows:

Venue graph – where do people go next?

Top venues in New York

We learned a great deal about analysis on graph data in the previous chapter, hence we will not be doing a repeat of all that analysis on this dataset again. The point of extracting a graph dataset out of Foursquare was to illustrate how we can get imaginative with our data extraction process also. This is the kind of creativity that makes social media analytics an exciting field of study.

Note

We have skipped repeating a great deal of analysis that can be done on the extracted venue graph data. But the data is in the same format as we used in the previous chapters. Users are encouraged to replicate the results obtained in the previous chapters on the current dataset. This will give you an intense flavor of actually working on some of these problems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.213.214