Category trend analysis

Foursquare is essentially a collection of real-life locations databases and their interaction with the actual users of those locations. How the users interact with these venues is the kind of data which encapsulates a lot of interesting information, both about the venues and the users. For our opening analysis into the Foursquare data, we will try to answer some questions about the choices of users in different cities across the world. We will learn how to extract data relevant to our analysis, how to ask the relevant questions and answer them using visualizations, and lastly, how to fit a usable analytics use case around the data. So let's dive in!

Getting the data – the usual hurdle

We want to get check-in data for some important cities across the globe and then use that data to find out what are the category trends are in those cities. Then we will proceed further with that data and try to build a recommender system which will tell us which restaurant category to venture into if we want to make it big in any of those cities.

The first step in any analytical process is identifying the data that we will need to perform the necessary analysis. In our particular case, the data that we will need is the check-in data for all the venues in the identified cities. Once we have this outline of thedata required, we will check how to actually extract that data.

The required end point

The first point in our data extraction process is always identifying the required API end point using the API documentation. After digging through the API documentation, we found the following end point to be of interest to us:

https://api.foursquare.com/v2/venues/explore?v=20131016&ll=40.00%2C%20-74.8

This end point will give us the venue information around the longitude and latitude supplied (here latitude = 40.00 and longitude = -74.8). So this is the end point that we can use for getting data about locations of a city. But the twist is that actually getting all the data for a city is still a mystery to us.

Getting data for a city – geometry to the rescue

The idea to extract venue data for a city is a tricky one. As the API will only give you the location data for a sphere around a point, the idea must involve tracing that sphere across the city and collecting the corresponding data points.

To extract data for a city we will use the following strategy:

  1. Start with the city's central latitude and longitude.
  2. Get the venue detail around those centers.
  3. Get the radial distance around the city center and get new centers.
  4. Extract data for these new centers.
  5. Repeat the process for a sufficient number of times to cover an approximate large area around a city center.

The scheme described in these steps is visually represented in the following figure. The idea is to start with C1 (city center) and then use it to find C2 (new centers):

Getting data for a city – geometry to the rescue

City data extraction

So this is the theoretical idea.

Let's look at the code that will achieve this data extraction. We have written three utility functions for this data extraction:

  • explore_around_point: This is the most basic function for our data extraction. It takes a longitude, latitude pair and gives data about that center in a fixed radius.
  • span_a_arc: This function takes the starting center and then spans an arc around that center. As an output, it generates the next pair of candidate centers.
  • get_data_for_points: This function invokes the necessary function to extract data for a collection of centers.

Armed with these functions, let's see how we can extract data for a particular city, say Istanbul:

#City data extraction for Istanbul
#Get longitude latitude for Istanbul
istanbul_city_center = cbind(28.9784,41.0082)
colnames(istanbul_city_center) = c("lon","lat")

#Name the output csv file 
file_level1 = "istanbul_data_l1.csv"
file_level2 = "istanbul_final_data.csv"
#Function to repeatedly generate the new centres and get data for these centres
# Takes the number of levels we want to traverse as an argument
get_data_for_city <- function(city_center, levels_to_traverse = 2){
 new_centers_l1 = span_a_arc(city_center)
 out_df = data.frame()
 df_cent = explore_around_point(city_center)
 out_df = rbind(out_df, df_cent)
 df_l1 = get_data_for_points(new_centers_l1)
 out_df = rbind(out_df, df_l1)
 for(j in 1:levels_to_traverse){
  new_centers_lu = c()
  df_l2 = data.frame()
  for(i in 1:nrow(new_centers_l1)){
   new_cent = span_a_arc(new_centers_l1[i,], start_deg = -25, end_deg = 25, degree_step = 25)
   new_centers_lu = rbind(new_centers_lu, new_cent)
   df_l2 = get_data_for_points(new_centers_lu)
   sum_points = sum_points + nrow(new_centers_lu)
   new_centers_l1 = new_centers_lu
   out_df = rbind(out_df, df_l2)
   write.csv(out_df, file = file_level2, row.names = FALSE)
  }
 }
 write.csv(out_df, file = file_level2, row.names = FALSE)
 return(out_df)
}# Function call for final data collection
final_data_istanbul = get_data_for_city(istanbul_city_center, levels_to_traverse = 2)

Once we execute this snippet of code, we will get the data about Istanbul in a neatly formatted DataFrame. These steps can be repeated for different city centers by supplying the necessary longitudes and latitudes. The extracted DataFrame for the Istanbul data looks like this:

Getting data for a city – geometry to the rescue

Istanbul data

We note that some venues' data have been repeated. This is the side effect of the crude algorithm that we have developed. We will fix it up in the analysis step.

Analysis – the fun part

Now that we have gone through the nitty gritty process of data collection, it is time for the fun part: the analysis of the data. As part of our illustrative data collection and analysis process, we have collected data for a total of seven cities. The first step in performing any kind of analysis on all that data is to combine the data together (we had persisted all the data that we extracted in the previous step).

This can be achieved with the following snippet of code:

# Getting data for all the city we have extracted data for
city_names <- c("ny", "istanbul", "paris", "la", "seattle", "london", "chicago")
all_city_data_with_category <- data.frame()
for ( city in city_names){
    #combine all cities data and remove duplicated
    city_data <- read.csv(file = paste(city,"_final_data.csv", sep = ''),stringsAsFactors = FALSE)
    # Removing duplicated data points 
    city_data <- city_data[!duplicated(city_data$venue.id),]
    # Combining with the category ids
    city_data_with_category <- join_city_category(city_data)
    city_data_with_category["cityname"] <- city
    all_city_data_with_category <- rbind(all_city_data_with_category, city_data_with_category)
}

Now we have all the data in the combined DataFrame, all_city_data_with_category. This DataFrame will form the basis of all our future analysis. The customary head function on this DataFrame returns the following information:

Analysis – the fun part

All city data

Basic descriptive statistics – the usual

Now that we have our data all extracted and processed in a neat tabular format, we can get started with some visualizations to get some more knowledge about the categories and their distribution across all the cities.

Let's start by finding out which cities among these have the most data of interest to us (all through our analysis, we will use the venue.stats as our metric of interest as it represents the total check-in at the venue):

summary_df <- all_city_data_with_category %>% 
       group_by(cityname) %>% 
       summarise(total_checkins = sum(venue.stats), city_user= sum(venue.usersCount), city_tips = sum(venue.tipCount))
ggplot(summary_df, aes(x=cityname, y=total_checkins)) + 
 geom_bar(stat="identity") +
 ggtitle("City wise check-ins") + 
  theme(plot.title = element_text(lineheight=.8, face="bold",hjust = 0.5))

The preceding code snippet produces the distribution of venue.stats for each of the cities. The bar graph gives us important information about which city is the most popular when it comes to total check-in counts:

Basic descriptive statistics – the usual

Total check-ins

With a single look at the graph, we have a first little hint of an insight. Istanbul is the city having most check-ins after New York, which straight away seem different from the general perception. And it highlights the importance of making data-based decisions instead of perception-based decisions, analytics 101 in our first plot.

Now let's find out how we are doing in these cities in terms of the major categories that we have. This will tell us which city has the most widespread representation in terms of the total categories, so we can assume this to be indicative of being a diverse city. Going by the general perception, we would expect New York to be one of the most diverse cities. Let's find out if the data supports this perception or not:

# Total category count for different cities
cat_rep_summary <- all_city_data_with_category %>% 
                  group_by(cityname) %>% 
                  summarise(city_category_count = n_distinct(cat_name))
ggplot(cat_rep_summary, aes(x=cityname, y=city_category_count), col(cityname))+  
  geom_bar(stat="identity")+
  ggtitle("Total categories for each city") + 
  theme(plot.title = element_text(lineheight=.8, face="bold",hjust = 0.5))

We get the following graph as the output of the preceding code. This graph tells us which city offers the most diverse category choices:

Basic descriptive statistics – the usual

Total categories represented in cities

So our perception in this case is backed up by the data that we have. New York indeed is the city with the most diverse range of categories.

Now we want to focus on this data a bit; we want to see how the total categories represented in a city are distributed across the various major categories that Foursquare provides. We will plot the percentage distribution of total check-ins across all the major categories and try to come up with a visual representation of the most dominant category for each city:

# Distribution of city check-ins across the major categories
super_cat_summary_detail <- all_city_data_with_category %>% 
    group_by(cityname,super_cat_name) %>% 
    summarise(city_category_count = sum(venue.stats)) %>%
    mutate(category_percentage = city_category_count/sum(city_category_count))

# For brevity we will only plot the two cities with most check-ins
p5 <- ggplot(subset(super_cat_summary_detail, cityname %in% c("ny", "istanbul")), aes(x=super_cat_name, y=category_percentage))
(p5 <- p5 + geom_bar(stat="identity") + 
    theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
        facet_wrap(~cityname, ncol = 1)) +
  ggtitle("category distribution for NY and Istanbul") + 
  theme(plot.title = element_text(lineheight=.8, face="bold",hjust = 0.5))90,hjust=1,vjust=0.5)) +
        facet_wrap(~cityname, ncol = 1))

The graph generated by the code is given here. This graph informs us about the differences/similarities that exist between the two cities we have selected:

Basic descriptive statistics – the usual

Category distribution for NY and Istanbul

A detailed inspection of the accompanying plot again reveals a lot of information. The major category, Food, is the major category across both the cities which is not really a huge surprise.

From the plot, you can determine that Istanbul is a city you can easily associate with Arts & Entertainment and Travel & Transport. Whereas New York is the city for Nightlife Spot and Shop & Service. Also you can easily spot that Istanbul has no representation in the College & Universities category. So either the students are not checking-in or they are not really attending the educational institutes. Well this is a question the present data cannot answer. Analytics can only deduce insight for which we have accompanying data.

Before we move on to build a recommendation engine on top of our categories data, we will draw one more plot. Until now we have been focusing on the super categories only but those are the umbrella categories. Now we want to see how the breakup is among the top children categories.

For drawing this plot, we will do a city-wise summarization on the category name using the following code snippet:

# Top 5 category distribution for each city
cat_summary_detail <- all_city_data_with_category %>% 
    group_by(cityname,cat_name) %>% 
    summarise(city_category_count = sum(venue.stats)) %>% 
    mutate(category_percentage = city_category_count/sum(city_category_count)) %>%
    top_n(5)
p5 <- ggplot(cat_summary_detail, aes(x=cat_name, y=category_percentage))
(p5 <- p5 + geom_bar(stat="identity") + ylab("Check-in percentage") + xlab("Categories") +
       theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) +
      facet_wrap(~cityname, ncol = 1)) +
      ggtitle("category distribution for all cities") + 
      theme(plot.title = element_text(lineheight=.8, face="bold",hjust = 0.5))

The resultant plot of this code section highlights the top five categories of all the cities. This plot again throws some obvious answers and some puzzling information. Spend a moment studying the plot:

Basic descriptive statistics – the usual

Category distribution for all cities

The most obvious information from the plot is that only one category is common across all the cities which is, no prize for guessing, Bar. It highlights the universality of alcohol, even when we have Istanbul in the mix. The second highlight is the prominence of the Transport category for New York and Chicago, which when investigated further can be traced to their iconic stations. The next genuine surprise comes in the form of the most dominant category in Istanbul, which is a road, but when you look inside the data you find out one of the most important tourist attractions in Istanbul is İstiklal Avenue.

This section helps us in uncovering a lot of surprises and obvious information using just simple bar plots and hence it helps us establish an important caveat of data analytics:

Note

Never under estimate the power of basic descriptive statistics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.56.194