Foursquare is essentially a collection of real-life locations databases and their interaction with the actual users of those locations. How the users interact with these venues is the kind of data which encapsulates a lot of interesting information, both about the venues and the users. For our opening analysis into the Foursquare data, we will try to answer some questions about the choices of users in different cities across the world. We will learn how to extract data relevant to our analysis, how to ask the relevant questions and answer them using visualizations, and lastly, how to fit a usable analytics use case around the data. So let's dive in!
We want to get check-in data for some important cities across the globe and then use that data to find out what are the category trends are in those cities. Then we will proceed further with that data and try to build a recommender system which will tell us which restaurant category to venture into if we want to make it big in any of those cities.
The first step in any analytical process is identifying the data that we will need to perform the necessary analysis. In our particular case, the data that we will need is the check-in data for all the venues in the identified cities. Once we have this outline of thedata required, we will check how to actually extract that data.
The first point in our data extraction process is always identifying the required API end point using the API documentation. After digging through the API documentation, we found the following end point to be of interest to us:
https://api.foursquare.com/v2/venues/explore?v=20131016&ll=40.00%2C%20-74.8
This end point will give us the venue information around the longitude and latitude supplied (here latitude = 40.00 and longitude = -74.8). So this is the end point that we can use for getting data about locations of a city. But the twist is that actually getting all the data for a city is still a mystery to us.
The idea to extract venue data for a city is a tricky one. As the API will only give you the location data for a sphere around a point, the idea must involve tracing that sphere across the city and collecting the corresponding data points.
To extract data for a city we will use the following strategy:
The scheme described in these steps is visually represented in the following figure. The idea is to start with C1 (city center) and then use it to find C2 (new centers):
So this is the theoretical idea.
Let's look at the code that will achieve this data extraction. We have written three utility functions for this data extraction:
explore_around_point
: This is the most basic function for our data extraction. It takes a longitude, latitude pair and gives data about that center in a fixed radius.span_a_arc
: This function takes the starting center and then spans an arc around that center. As an output, it generates the next pair of candidate centers.get_data_for_points
: This function invokes the necessary function to extract data for a collection of centers.Armed with these functions, let's see how we can extract data for a particular city, say Istanbul:
#City data extraction for Istanbul #Get longitude latitude for Istanbul istanbul_city_center = cbind(28.9784,41.0082) colnames(istanbul_city_center) = c("lon","lat") #Name the output csv file file_level1 = "istanbul_data_l1.csv" file_level2 = "istanbul_final_data.csv" #Function to repeatedly generate the new centres and get data for these centres # Takes the number of levels we want to traverse as an argument get_data_for_city <- function(city_center, levels_to_traverse = 2){ new_centers_l1 = span_a_arc(city_center) out_df = data.frame() df_cent = explore_around_point(city_center) out_df = rbind(out_df, df_cent) df_l1 = get_data_for_points(new_centers_l1) out_df = rbind(out_df, df_l1) for(j in 1:levels_to_traverse){ new_centers_lu = c() df_l2 = data.frame() for(i in 1:nrow(new_centers_l1)){ new_cent = span_a_arc(new_centers_l1[i,], start_deg = -25, end_deg = 25, degree_step = 25) new_centers_lu = rbind(new_centers_lu, new_cent) df_l2 = get_data_for_points(new_centers_lu) sum_points = sum_points + nrow(new_centers_lu) new_centers_l1 = new_centers_lu out_df = rbind(out_df, df_l2) write.csv(out_df, file = file_level2, row.names = FALSE) } } write.csv(out_df, file = file_level2, row.names = FALSE) return(out_df) }# Function call for final data collection final_data_istanbul = get_data_for_city(istanbul_city_center, levels_to_traverse = 2)
Once we execute this snippet of code, we will get the data about Istanbul in a neatly formatted DataFrame. These steps can be repeated for different city centers by supplying the necessary longitudes and latitudes. The extracted DataFrame for the Istanbul data looks like this:
We note that some venues' data have been repeated. This is the side effect of the crude algorithm that we have developed. We will fix it up in the analysis step.
Now that we have gone through the nitty gritty process of data collection, it is time for the fun part: the analysis of the data. As part of our illustrative data collection and analysis process, we have collected data for a total of seven cities. The first step in performing any kind of analysis on all that data is to combine the data together (we had persisted all the data that we extracted in the previous step).
This can be achieved with the following snippet of code:
# Getting data for all the city we have extracted data for city_names <- c("ny", "istanbul", "paris", "la", "seattle", "london", "chicago") all_city_data_with_category <- data.frame() for ( city in city_names){ #combine all cities data and remove duplicated city_data <- read.csv(file = paste(city,"_final_data.csv", sep = ''),stringsAsFactors = FALSE) # Removing duplicated data points city_data <- city_data[!duplicated(city_data$venue.id),] # Combining with the category ids city_data_with_category <- join_city_category(city_data) city_data_with_category["cityname"] <- city all_city_data_with_category <- rbind(all_city_data_with_category, city_data_with_category) }
Now we have all the data in the combined DataFrame, all_city_data_with_category
. This DataFrame will form the basis of all our future analysis. The customary head
function on this DataFrame returns the following information:
Now that we have our data all extracted and processed in a neat tabular format, we can get started with some visualizations to get some more knowledge about the categories and their distribution across all the cities.
Let's start by finding out which cities among these have the most data of interest to us (all through our analysis, we will use the venue.stats
as our metric of interest as it represents the total check-in at the venue):
summary_df <- all_city_data_with_category %>% group_by(cityname) %>% summarise(total_checkins = sum(venue.stats), city_user= sum(venue.usersCount), city_tips = sum(venue.tipCount)) ggplot(summary_df, aes(x=cityname, y=total_checkins)) + geom_bar(stat="identity") + ggtitle("City wise check-ins") + theme(plot.title = element_text(lineheight=.8, face="bold",hjust = 0.5))
The preceding code snippet produces the distribution of venue.stats
for each of the cities. The bar graph gives us important information about which city is the most popular when it comes to total check-in counts:
With a single look at the graph, we have a first little hint of an insight. Istanbul is the city having most check-ins after New York, which straight away seem different from the general perception. And it highlights the importance of making data-based decisions instead of perception-based decisions, analytics 101 in our first plot.
Now let's find out how we are doing in these cities in terms of the major categories that we have. This will tell us which city has the most widespread representation in terms of the total categories, so we can assume this to be indicative of being a diverse city. Going by the general perception, we would expect New York to be one of the most diverse cities. Let's find out if the data supports this perception or not:
# Total category count for different cities cat_rep_summary <- all_city_data_with_category %>% group_by(cityname) %>% summarise(city_category_count = n_distinct(cat_name)) ggplot(cat_rep_summary, aes(x=cityname, y=city_category_count), col(cityname))+ geom_bar(stat="identity")+ ggtitle("Total categories for each city") + theme(plot.title = element_text(lineheight=.8, face="bold",hjust = 0.5))
We get the following graph as the output of the preceding code. This graph tells us which city offers the most diverse category choices:
So our perception in this case is backed up by the data that we have. New York indeed is the city with the most diverse range of categories.
Now we want to focus on this data a bit; we want to see how the total categories represented in a city are distributed across the various major categories that Foursquare provides. We will plot the percentage distribution of total check-ins across all the major categories and try to come up with a visual representation of the most dominant category for each city:
# Distribution of city check-ins across the major categories super_cat_summary_detail <- all_city_data_with_category %>% group_by(cityname,super_cat_name) %>% summarise(city_category_count = sum(venue.stats)) %>% mutate(category_percentage = city_category_count/sum(city_category_count)) # For brevity we will only plot the two cities with most check-ins p5 <- ggplot(subset(super_cat_summary_detail, cityname %in% c("ny", "istanbul")), aes(x=super_cat_name, y=category_percentage)) (p5 <- p5 + geom_bar(stat="identity") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + facet_wrap(~cityname, ncol = 1)) + ggtitle("category distribution for NY and Istanbul") + theme(plot.title = element_text(lineheight=.8, face="bold",hjust = 0.5))90,hjust=1,vjust=0.5)) + facet_wrap(~cityname, ncol = 1))
The graph generated by the code is given here. This graph informs us about the differences/similarities that exist between the two cities we have selected:
A detailed inspection of the accompanying plot again reveals a lot of information. The major category, Food, is the major category across both the cities which is not really a huge surprise.
From the plot, you can determine that Istanbul is a city you can easily associate with Arts & Entertainment and Travel & Transport. Whereas New York is the city for Nightlife Spot and Shop & Service. Also you can easily spot that Istanbul has no representation in the College & Universities category. So either the students are not checking-in or they are not really attending the educational institutes. Well this is a question the present data cannot answer. Analytics can only deduce insight for which we have accompanying data.
Before we move on to build a recommendation engine on top of our categories data, we will draw one more plot. Until now we have been focusing on the super categories only but those are the umbrella categories. Now we want to see how the breakup is among the top children categories.
For drawing this plot, we will do a city-wise summarization on the category name using the following code snippet:
# Top 5 category distribution for each city cat_summary_detail <- all_city_data_with_category %>% group_by(cityname,cat_name) %>% summarise(city_category_count = sum(venue.stats)) %>% mutate(category_percentage = city_category_count/sum(city_category_count)) %>% top_n(5) p5 <- ggplot(cat_summary_detail, aes(x=cat_name, y=category_percentage)) (p5 <- p5 + geom_bar(stat="identity") + ylab("Check-in percentage") + xlab("Categories") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) + facet_wrap(~cityname, ncol = 1)) + ggtitle("category distribution for all cities") + theme(plot.title = element_text(lineheight=.8, face="bold",hjust = 0.5))
The resultant plot of this code section highlights the top five categories of all the cities. This plot again throws some obvious answers and some puzzling information. Spend a moment studying the plot:
The most obvious information from the plot is that only one category is common across all the cities which is, no prize for guessing, Bar. It highlights the universality of alcohol, even when we have Istanbul in the mix. The second highlight is the prominence of the Transport category for New York and Chicago, which when investigated further can be traced to their iconic stations. The next genuine surprise comes in the form of the most dominant category in Istanbul, which is a road, but when you look inside the data you find out one of the most important tourist attractions in Istanbul is İstiklal Avenue.
This section helps us in uncovering a lot of surprises and obvious information using just simple bar plots and hence it helps us establish an important caveat of data analytics:
18.225.56.194