In Chapter 2, we ingested On-time Performance Data from the US Bureau of Transportation Statistics (BTS) so as to be able to model the arrival delay given various attributes of an airline flight—the purpose of the analysis is to cancel a meeting if the probability of the flight arriving within 15 minutes of the scheduled arrival time is less than 70%.
Before we delve into building statistical and machine learning models, it is important to explore the dataset and gain an intuitive understanding of the data—this is called exploratory data analysis , and it’s covered in more detail in Chapter 5. You should always carry out exploratory data analysis for any dataset that will be used as the basis for decision making. In this chapter, though, I talk about a different aspect of depicting data—of depicting data to end users and decision makers so that they can understand the recommendation that you are making. The audience of these visual representations, called dashboards , that we talk about in this chapter are not other data scientists, but are instead the end users. Keep the audience in mind as we go through this chapter, especially if you come from a data science background—the purpose of a dashboard is to explain an existing model, not to develop it. A dashboard is an end-user report that is interactive, tailored to end users, and continually refreshed with new data. See Table 3-1.
For Decision Makers | For Data scientists | |
Usage pattern | Dashboards | Exploratory data analysis |
Kinds of depictions | Current status, pie charts, trendlines | Model fits with error bars, kernel density estimates |
What does it explain? | Model recommendations and confidence | Input data, feature importance, model performance, etc. |
Data represented | Subset of dataset, tailored to user’s context | Aggregate of historical data |
Typical tools | Data Studio, Tableau, Qlik, Looker, etc. | Jupyter, python, R Studio, S-plus, matplotlib, seaborn, Matlab, etc. |
Mode of interaction | GUI-driven | Code-driven |
Update | Real time | Not real time |
Covered in | Chapter 3, Chapter 4 | Chapter 5 |
Example | From AAA fuel-gauge report | From AAA safety and educational foundation |
Very often, this step of creating end-user visual depictions goes by the anodyne name of “visualization,” as in visualizing data. However, I have purposefully chosen not to call it by that name, because there is more to this than throwing together a few bar graphs and charts. Dashboards are highly visual, interactive reports that have been designed to depict data and explain models.
All of the code snippets in this chapter are available in the GitHub repository at https://github.com/GoogleCloudPlatform/data-science-on-gcp/ in the folder 03_sqlstudio. See the README.md file in that directory for instructions on how to do the steps described in this chapter.
The purpose of this step in the modeling process is not simply to depict the data, but to improve your users’ understanding of how the model behaves. Whenever you are designing the display of a dataset, evaluate the design in terms of three aspects:
Does it accurately and honestly depict the data? This is important when the raw data itself can be a basis for decision making.
How well does it help envision not just the raw data, but the information content embodied in the data? This is crucial for the cases when you are relying on human pattern recognition and interaction to help reveal insights about the environment in which the data was collected.
Is it constructed in such a way that it explains the model being used to provide recommendations?
You want to build displays that are always accurate and honest. At the same time, the displays need to be interactive so as to provide viewers with the ability to play with the data and gain insights. Insights that users have gained should be part of the display of that information going forward in such a way that those insights can be used to explain the data.
The last point, that of explanatory power, is very important. The idea is to disseminate data understanding throughout your company. A statistical or machine learning model that you build for your users will be considered a black box, and while you might get feedback on when it works well and when it doesn’t, you will rarely get pointed suggestions on how to actually improve that model in the field. In many cases, your users will use your model at a much more fine-grained level than you ever will, because they will use your model to make a single decision, whereas in both training and evaluation, you would have been looking at model performance as a whole.
Although this holistic overview is useful for statistical rigor, you need people taking a close look at individual cases, too. Because users are making decisions one at a time, they are analyzing the data one scenario at a time. If you provide your users not just with your recommendation, but with an explanation of why you are recommending it, they will begin to develop insights into your model. However, your users will only be able to develop such insights into the problem and your recommendations if you give them ways to observe the data that went into your model. Give enough users ways to view and interact with your data, and you will have unleashed a never-ending wave of innovation.
Your users have other activities that require their attention. Why would they spend their time looking at your data? One of the ways to entice them to do that is by making the depiction of the information compelling. In my experience, 1 the most compelling displays are displays of real-time information in context. You can show people the average airline delay at JFK on January 12, 2012, and no one will care. But show a traveler in Chicago the average airline delay at ORD, right now and you will have their interest—the difference is that the data is in context (O’Hare Airport, or ORD, for a traveler in Chicago) and that it is real time.
In this chapter, therefore, we will look at building dashboards that combine accurate depictions with explanatory power and interactivity in a compelling package. This seems to be a strange time to be talking about building dashboards—shouldn’t the building of a dashboard wait until after we have built the best possible predictive model?
Building a dashboard when building a machine learning model is akin to building a form or survey tool to help you build the machine learning model. To build powerful machine learning models, you need to understand the dataset and devise features that help with prediction. By building a dashboard, you get to rope in the eventual users of the predictive model to take a close look at your data. Their fine-grained look at the data (remember that everyone is looking at the data corresponding to their context) will complement your more overarching look at it. As they look at the data and keep sending suggestions and insights about the data to your team, you will be able to incorporate them into the machine learning model that you actually build.
In addition, when presented with a dataset, you should be careful that the data is the way you imagine it to be. There is no substitute for exposing and exploring the data to ensure that. Doing such exploratory data analysis with an immediately attainable milestone—building a dashboard from your dataset—is a fine way to do something real with the data and develop awareness of the subtleties of your data. Just as you often understand a concept best when you explain it to someone, building an explanatory display for your data is one of the best ways to develop your understanding of a dataset. The fact that you have to visualize the data in order to do basic preprocessing such as outlier detection makes it clear that building visual representations is work you will be doing anyway. If you are going to be doing it, you might as well do it well, with an eye toward its eventual use in production.
Eventual use in production is the third reason why you should develop the dashboard first instead of leaving it as an afterthought. Building explanatory power should be constantly on your mind as you develop the machine learning model. Giving users just the machine learning model will often go down poorly—they have no insight into why the system is recommending whatever it does. Adding explanations to the recommendations is more likely to succeed. For example, if you accompany your model prediction with five of the most salient features presented in an explanatory way, it will help make the model output more believable and trustworthy. 2
Even for cases for which the system performs poorly, you will receive feedback along the lines of “the prediction was wrong, but it is because Feature #3 was fishy. I think maybe you should also look at Factor Y.” In other words, shipping your machine learning model along with an explanation of its behavior gets you more satisfied users, and users whose criticism will be a lot more constructive. It can be tempting to ship the machine learning model as soon as it is ready, but if there is a dashboard already available (because you were building it in parallel), it is easier to counsel that product designers consider the machine learning model and its explanatory dashboard as the complete product.
Where should these dashboards be implemented? Find out the environment that gets the largest audience of experts and eventual users and build your dashboard to target that environment.
Your users might already have a visualization interface with which they are familiar. Especially when it comes to real-time data, your users might spend their entire work-day facing a visualization program that is targeted toward power users—this is true of weather forecasts, air traffic controllers, and options traders. If that is the case, look for ways to embed your visualizations into that interface. In other cases, your users might prefer that your visualizations be available from the convenience of their web browser. If this is the case, look for a visualization tool that lets you share the report as an interactive, commentable document (not just a static web page). In many cases, you might have to build multiple dashboards for different sets of users (don’t shoehorn everything into the same dashboard).
Because the explanatory power of a good dashboard is why we are building visualizations, it is important to ensure that our explanations are not misleading. In this regard, it is best not to do anything too surprising. Although modern-day visualization programs are chock-full of types of graphs and palettes, it is best to pair any graphic with the idiom for which it is appropriate. For example, some types of graphics are better suited to relational data than others, and some graphics are better suited to categorical data than to numerical data.
Broadly, there are four fundamental types of graphics: relational (illustrating the relationship between pairs of variables), time–series (illustrating the change of a variable over time), geographical maps (illustrating the variation of a variable by location), and narratives (to support an argument). Narrative graphics are the ones in magazine spreads, which win major design awards. The other three are more worker-like.
You have likely seen enough graphical representations to realize intuitively that the graph is somehow wrong when you violate an accuracy, honesty, or aesthetic principle,3 but this section of the book lists a few of the canonical ones. For example, it is advisable to choose line graphs or scatter plots for relational graphics, and to ensure that autoscaling of the axes doesn’t portray a misleading story about your data. A good design principle is that your time–series graphs should be more horizontal than vertical, and that it is the data lines and not the graph’s accoutrements (grid lines, labels, etc.) that ought to dominate your graphics. Maximizing the ratio of data to space and ink is a principle that will stand you in good stead when it comes to geographical data—ensure that the domain is clipped to the region of interest, and go easy on place names and other text labels.
Just as you probably learned to write well by reading good writers, one of the best ways to develop a feel for accurate and compelling graphics is to increase your exposure to good exemplars. The Economist newspaper4 has a Graphic Detail blog that is worth following—they publish a chart, map, or infographic every weekday, and these run the gamut of the fundamental graphics types. Figure 3-1 shows a graphic from the blog.5
The graphic depicts the increase in the number of coauthors on scientific papers over the past two decades. The graphic itself illustrates several principles of good design. It is a time–series, and as you’d expect of this type of graphic, the time is on the horizontal axis and the time-varying quantity (number of authors per article or the number of articles per author) is on the vertical axis. The vertical axis values start out at zero, so that the height of the graphs is an accurate indicator of magnitude. Note how minimal the chart junk is—the axes labels and gridlines are very subtle and the title doesn’t draw attention to itself. The data lines, on the other hand, are what pop out. Note also the effective use of repetition—instead of all the different disciplines (Economics, Engineering, etc.) being on the same graph, each discipline is displayed on its own panel. This serves to reduce clutter and makes the graphs easy to interpret. Each panel has two graphs, one for authors per article and the other for articles per author. The colors remain consistent across the panels for easy comparison, and the placement of the panels also encourages such comparisons. We see, for example, that the increase in number of authors per article is not accompanied by an increase in articles per author in any of the disciplines, except for Physics & Astronomy. Perhaps the physicists and astronomers are gaming the system?
The graphic does, however, subtly mislead viewers who are in a hurry. Take a moment and try to critique the graphic—figure out how a viewer might have been misled. It has to do with the arrangement of the panels. It appears that the creator of the graphic has arranged the panels to provide a pleasing upward trend between the panels, but this upward trend is misleading because there is no relationship between the number of authors per article in Economics in 2016 and the same quantity in Engineering in 1996. This misdirection is concerning because the graph is supposed to support the narrative of an increasing number of authors, but the increase is not from one author to six authors over two decades—the actual increase is much less dramatic (for example, from four to six in medicine). However, a viewer who only glances at the data might wrongly believe that the increase in number of authors is depicted by the whole graph, and is therefore much more than it really is.
To create dashboards to allow interactive analysis of the data, we will need to store the data in a manner that permits fast random access and aggregations. Because our flight data is tabular, SQL is a natural choice, and if we are going to be using SQL, we should consider whether a relational database meets our needs. Relational databases are a mature technology and remain the tool of choice for many business problems. Relational database technology is well known and comes with a rich ecosystem of interoperable tools. The problem of standardized access to relational databases from high-level programming languages is pretty much solved.
PostgreSQL is a very popular, open-source relational database that is used in production at many enterprises. In addition to its high performance, PostgreSQL is easy to program against—it supports ANSI-SQL, GIS functionality, client libraries in a variety of programming languages, and standard connector technologies such as Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC).
Google Cloud SQL offers a managed database service that supports PostgreSQL, MySQL and SQL Server. Cloud SQL manages backups, patches, updates, and even replication while providing for global availability, automatic failover, and high uptime. For best performance, choose a machine whose RAM is large enough to hold your largest table in memory—as of this writing, available machine types range from a single CPU with less than 4 GB of memory all the way to a 96 CPU machine with 624 GB of memory. Balance this desire for speed with the monthly cost of a machine, of course.
Let’s configure a Cloud SQL instance, create a database table in it, and load the table with the data we ingested into Cloud Storage. You can do all these things on the command line using gcloud, but let’s begin by using the SQL section of Cloud Platform Console and select Create Instance. Choose PostgreSQL and then fill out the form as follows:
Call the instance “flights”
Generate a strong password by clicking on the GENERATE button
Choose the default PosgreSQL version
Choose the region where your bucket of CSV data exists.
Choose a single zone instance since we are just trying it out. We won’t take this to production.
Choose a Standard machine type with 2 vCPU.
Click Create Instance, accepting all the other defaults
In order to import data into a Postgres table, we first have to create an empty database and a table with the correct schema.
In the GCP web console, navigate to the databases section of Cloud SQL and create a new database called bts
. This will be where we load our data.
Next, we have to creating a file with the following syntax, to create a column for every field in the CSV file:
drop table if exists flights; CREATE TABLE flights ( "Year" TEXT, "Quarter" TEXT, "Month" TEXT, "DayofMonth" TEXT, "DayOfWeek" TEXT, "FlightDate" DATE, "Reporting_Airline" TEXT, "DOT_ID_Reporting_Airline" TEXT, "IATA_CODE_Reporting_Airline" TEXT, …
For your convenience, the file I created is already in the Git repository, so just go to Cloud Shell, change into the 03_sqlstudio directory
Stage the file into Google Cloud Storage (changing the bucket to one that you own):
gsutil cp create_table.sql gs://cloud-training-demos-ml/flights/ch3/create_table.sql
In the web console, navigate to the flights instance of Cloud SQL and select IMPORT. In the form, specify the location of create_table.sql and specify that you want to create a table in the database
bts
(see Figure 3-2).
A few seconds later, the empty table will be created.
We can now load the CSV files into this table. Start by loading the January data by browsing to 201501.csv in your bucket and specifying CSV as the format, bts as the database, and flights as the table (see Figure 3-3).
Note that the user interface doesn’t provide a way to skip the first line, so the header will also get loaded as a row in the table. Fortunately, our schema calls all the fields as text, so this doesn’t pose a problem – after loading the data, we can delete the row corresponding to the header. If we have a more realistic schema, we will have to remove the header line before loading the file.
We can connect to the Cloud SQL instance from Cloud Shell using: 6
gcloud sql connect flights --user=postgres
In the prompt that comes up, we connect to the bts database:
c bts; Then, we can run a query to obtain the 5 busiest airports: SELECT "Origin", COUNT(*) AS num_flights FROM flights GROUP BY "Origin" ORDER BY num_flights DESC LIMIT 5;
While this is performant because the dataset is relatively small (only January!), as I added more months, the database started to slow down.
Relational databases are particularly well suited to smallish datasets on which we wish to do ad hoc queries. Even for larger datasets, we can tune the performance of a relational database by indexing the columns of interest. Further, because relational databases typically support transactions and guarantee strong consistency, they are an excellent choice for data that will be updated often.
However, a relational database is a poor choice if your data is primarily read-only, if your dataset sizes go into the terabyte range, or if your data streams in at high rates. This describes our flight delay use case. So, let’s switch from a relational database to an analytics data warehouse – BigQuery. The analytics data warehouse will allow us to use SQL and is much more capable of dealing with large datasets and ad hoc queries (i.e. doesn’t need the columns to be indexed).
If you are following along with me, delete the Cloud SQL instance. We won’t need it any further in this book.
In Chapter 2, we loaded the CSV data into BigQuery into a table named
flights_raw
in the dataset
dsongcp
. Let’s explore that dataset a bit – this is not a full exploratory analysis, which I will do in Chapter 5.
My goal here is to do “just enough” analysis on the data and then quickly pivot to building my first model. Once I have the model, I will be able to build a dashboard to explain that model. The idea is to get a first iteration out in front of users as quickly as possible. Going from ingested data to minimum viable outputs (model, dashboard, etc) quickly is what agile development in data science looks like.
Teams that wait until they build a fully capable model before incorporating it into decision tools often build the wrong model (i.e. they solve the wrong problem because of misunderstanding how the decision will be used) or choose unviable technology (that is hard to get into production). Avoid these traps by testing your work with real users as quickly as possible!
Navigate to the BigQuery section of the Google Cloud web console, select the flights_raw table. On the right side of the window, select Schema (see Figure 3-4). Which fields do you think are relevant to predicting flight arrival delays?
Just looking at the schema is not enough. For example, do we really need the Year, Month, DayOfMonth, and so on? Isn’t the FlightDate enough? It’s best to not have duplicative data – the more columns we have, the more work we have to do to keep analysis consistent.
Similarly, which of the various Airline columns do we need? For the Airline columns, we did read the description on the BTS website in Chapter 2, and will probably follow their recommendation that Reporting_Airline be the one that we use. Still, it’s worth verifying why that is.
To make decisions like this, we can use two features – the Preview tab and the Table Explorer tab (see Figure 3-4).
Looking at the preview (see Figure 3-5), the Year, Month, etc. columns do seem to be redundant. (If you are following along with me, you may see different rows, because the Preview just picks whatever is most handy.)
Let’s check whether we can resurrect the FlightDate from the other columns and extract the date pieces from the FlightDate. We can do that with SQL:
SELECT FORMAT("%s-%02d-%02d", Year, CAST(Month AS INT64), CAST(DayofMonth AS INT64)) AS resurrect, FlightDate, CAST(EXTRACT(YEAR FROM FlightDate) AS INT64) AS ex_year, CAST(EXTRACT(MONTH FROM FlightDate) AS INT64) AS ex_month, CAST(EXTRACT(DAY FROM FlightDate) AS INT64) AS ex_day, FROM dsongcp.flights_raw LIMIT 5
The result appears to bear this out:
Row | resurrect | FlightDate | ex_year | ex_month | ex_day |
1 | 2015-02-19 | 2015-02-19 | 2015 | 2 | 19 |
2 | 2015-02-20 | 2015-02-20 | 2015 | 2 | 20 |
3 | 2015-02-22 | 2015-02-22 | 2015 | 2 | 22 |
4 | 2015-02-23 | 2015-02-23 | 2015 | 2 | 23 |
5 | 2015-02-25 | 2015-02-25 | 2015 | 2 | 25 |
But we have to be sure. Let’s print out rows where the extracted data from FlightDate are not identical:
WITH data AS ( SELECT FORMAT("%s-%02d-%02d", Year, CAST(Month AS INT64), CAST(DayofMonth AS INT64)) AS resurrect, FlightDate, CAST(EXTRACT(YEAR FROM FlightDate) AS INT64) AS ex_year, CAST(EXTRACT(MONTH FROM FlightDate) AS INT64) AS ex_month, CAST(EXTRACT(DAY FROM FlightDate) AS INT64) AS ex_day, FROM dsongcp.flights_raw ) SELECT * FROM data WHERE resurrect != CAST(FlightDate AS STRING)
This query returns an empty resultset, and so we are sure that we can safely keep only the FlightDate column.
How about the Airline code? Switch to the Table Explorer tab and select the three airline columns as shown in Figure 3-6.
BigQuery analyzes the full dataset and shows the unique values in the table, as shown in Figure 3-6.
It is clear from the Table Explorer that we want to use either the Reporting_Airline or the IATA_CODE_Reporting_Airline. As before checking to see if there are rows where these are different indicates that Reporting_Airline is sufficient.
Based on such analysis on the remaining fields, I came up with the following sets of operations I want to do to the raw data to make it more usable. For example, the Departure Delay should be a floating point number and not a string. The cancellation code should be a boolean and not “1.00”:
CREATE OR REPLACE VIEW dsongcp.flights AS
SELECT FlightDate AS FL_DATE, Reporting_Airline AS UNIQUE_CARRIER, OriginAirportSeqID AS ORIGIN_AIRPORT_SEQ_ID, Origin AS ORIGIN, DestAirportSeqID AS DEST_AIRPORT_SEQ_ID, Dest AS DEST, CRSDepTime AS CRS_DEP_TIME, DepTime AS DEP_TIME, CAST(DepDelay AS FLOAT64) AS DEP_DELAY, CAST(TaxiOut AS FLOAT64) AS TAXI_OUT, WheelsOff AS WHEELS_OFF, WheelsOn AS WHEELS_ON, CAST(TaxiIn AS FLOAT64) AS TAXI_IN, CRSArrTime AS CRS_ARR_TIME, ArrTime AS ARR_TIME, CAST(ArrDelay AS FLOAT64) AS ARR_DELAY, IF(Cancelled = '1.00', True, False) AS CANCELLED, IF(Diverted = '1.00', True, False) AS DIVERTED, DISTANCE FROM dsongcp.flights_raw;
In order to avoid repeating these casts in all queries from here on out, I am creating a view that consists of the SELECT statement (see the first line in the listing above). A view is a virtual table – we can query the view just as if it were a table:
SELECT ORIGIN, COUNT(*) AS num_flights FROM dsongcp.flights GROUP BY ORIGIN ORDER BY num_flights DESC LIMIT 5
Any queries that happen on the view are rewritten by the database engine to happen on the original table – conceptually, a view works as if the SQL corresponding to the view was to be inserted into every query that uses the view.
What if the view includes a WHERE clause so that the number of rows is much less? In such cases, it would be far more efficient to export the results into a table and query that table instead:
CREATE OR REPLACE TABLE dsongcp.flights AS SELECT
But what if you export the results into a table and then the original table has a new month of data added to it? We’d have to rerun the table creation statement to make the extracted table up-to-date. In the case of a view, we wouldn’t have to do anything special – all new queries would automatically be querying the entire raw table, and so include the new month of data.
Can we have our cake and eat it too? Can we get the “live” nature of a view, but the query efficiency of a table? Yes. It’s called a materialized view :
CREATE MATERIALIZED VIEW dsongcp.flights AS SELECT
The view is materialized into a table, and kept up to date by BigQuery. While views are free, materialized views carry an extra cost because of the extra storage and compute overhead they involve.
In this book, I’ll use a regular view during development, since it’s easy to come back and add new columns, etc.. Later on, once we go to production, it’s quite simple to change it over to a materialized view – none of the client code will need to change.
Intuitively, we feel that if the flight is delayed by 15 minutes, it will also tend to arrive 15 minutes late. So, our model could be that we cancel the meeting if the departure delay of the flight is 15 minutes or more. Of course, there is nothing here about the probability (recall that we wanted to cancel if the probability of an arrival delay of 15 minutes was greater than 30%). Still, it will be a quick start and give us something that we can ship now and iterate upon.
Suppose that we need to know how often we will be making the right decision if our decision rule is the following:
If DEP_DELAY ≥ 15, cancel the meeting; otherwise, go ahead.
There are four possibilities in the contingency table or the confusion matrix , which you can see in Table 3-2.
Arrival delay < 15 minutes | Arrival delay ≥ 15 minutes | |
We did not cancel meeting | Correct (true negative) | False negative |
We cancel meeting | False positive | Correct (true positive) |
If we don’t cancel the meeting and it turns out that the flight arrived more than 15 minutes late (let’s call that a “positive”), it is clear that we made a wrong decision. It is arbitrary whether we refer to it as a false negative (treating the late arrival as a positive event) or as a false positive (treating the late arrival as a negative event). The common approach is to term the rarer event, or the event we are looking for, as the positive. Here, delayed flights are (hopefully) rarer, and we are deciding whether to cancel the meeting. How do we find out how often the decision rule of thresholding the departure delay at 15 minutes will tend to be correct? We can evaluate the first box in the confusion matrix using BigQuery:
SELECT COUNT(*) AS true_negatives FROM dsongcp.flights WHERE dep_delay < 15 AND arr_delay < 15
There are 4430885 such flights.
To compute all four values in a single statement, move the WHERE clause into the SELECT itself:
SELECT COUNTIF(dep_delay < 15 AND arr_delay < 15) AS true_negatives, COUNTIF(dep_delay < 15 AND arr_delay >= 15) AS false_negatives, COUNTIF(dep_delay >= 15 AND arr_delay < 15) AS false_positives, COUNTIF(dep_delay >= 15 AND arr_delay >= 15) AS true_positives, COUNT(*) AS total FROM dsongcp.flights WHERE arr_delay IS NOT NULL AND dep_delay IS NOT NULL
Each of the COUNTIF statements counts maintains a count of the number of rows that match the given criterion, and COUNT(*) counts all rows. This way, we get to scan the table just once, and still manage to collect the four numbers that form the confusion matrix:
Row | true_negatives | false_negatives | false_positives | true_positives | total |
1 | 4430885 | 232701 | 219684 | 830738 | 5714008 |
Recall that these numbers assume that we are making a decision by thresholding the departure delay at 15 minutes. But is that the best threshold?
Ideally, we want to try out different values of the threshold and pick the one that provides the best results. To do so, we can declare a variable called THRESH and use it in the query. This way, there is just one number to change when we want to try out a different threshold.:
DECLARE THRESH INT64; SET THRESH = 15;SELECT COUNTIF(dep_delay < THRESH AND arr_delay < 15) AS true_negatives, COUNTIF(dep_delay < THRESH AND arr_delay >= 15) AS false_negatives, COUNTIF(dep_delay >= THRESH AND arr_delay < 15) AS false_positives, COUNTIF(dep_delay >= THRESH AND arr_delay >= 15) AS true_positives, COUNT(*) AS total FROM dsongcp.flights WHERE arr_delay IS NOT NULL AND dep_delay IS NOT NULL
Still, I’d rather not run the query several times, once for each threshold. It’s not about the drudgery of it – I could avoid the manual work by using a for loop in a script. What I’m objecting to is scanning the table four times. The better way to do this in SQL is to declare an array of possible thresholds and then group by them:
SELECT THRESH, COUNTIF(dep_delay < THRESH AND arr_delay < 15) AS true_negatives, COUNTIF(dep_delay < THRESH AND arr_delay >= 15) AS false_negatives, COUNTIF(dep_delay >= THRESH AND arr_delay < 15) AS false_positives, COUNTIF(dep_delay >= THRESH AND arr_delay >= 15) AS true_positives, COUNT(*) AS total FROM dsongcp.flights, UNNEST([5, 10, 11, 12, 13, 15, 20]) AS THRESH WHERE arr_delay IS NOT NULL AND dep_delay IS NOT NULL GROUP BY THRESH
This way, we get to run a single query, which scans the table just once, and still manage to create contingency tables for all the thresholds we want to try. The result consists of the four contingency table values for each of the seven values of the threshold:
Row | THRESH | true_negatives | false_negatives | false_positives | true_positives | total |
1 | 5 | 3931979 | 144669 | 718590 | 918770 | 5714008 |
2 | 10 | 4242286 | 184944 | 408283 | 878495 | 5714008 |
3 | 11 | 4288279 | 193912 | 362290 | 869527 | 5714008 |
4 | 12 | 4329146 | 203068 | 321423 | 860371 | 5714008 |
5 | 13 | 4366641 | 212498 | 283928 | 850941 | 5714008 |
6 | 15 | 4430885 | 232701 | 219684 | 830738 | 5714008 |
7 | 20 | 4542475 | 291791 | 108094 | 771648 | 5714008 |
Learn SQL. You’ll thank me later.
This is all well-and-good, but recall that our goal (see Chapter 1) is to cancel the client meeting if the probability of arriving 15 minutes late is 30% or more. How close do we get with each of these thresholds?
To know this, we need to compute the fraction of times a decision is wrong. We can do this by calling the above result the contingency table, and then computing the necessary ratios:
WITH contingency_table AS ( SELECT THRESH, COUNTIF(dep_delay < THRESH AND arr_delay < 15) AS true_negatives, COUNTIF(dep_delay < THRESH AND arr_delay >= 15) AS false_negatives, COUNTIF(dep_delay >= THRESH AND arr_delay < 15) AS false_positives, COUNTIF(dep_delay >= THRESH AND arr_delay >= 15) AS true_positives, COUNT(*) AS total FROM dsongcp.flights, UNNEST([5, 10, 11, 12, 13, 15, 20]) AS THRESH WHERE arr_delay IS NOT NULL AND dep_delay IS NOT NULL GROUP BY THRESH ) SELECT ROUND((true_positives + true_negatives)/total, 2) AS accuracy, ROUND(false_positives/(true_positives+false_positives), 2) AS fpr, ROUND(false_negatives/(false_negatives+true_negatives), 2) AS fnr, * FROM contingency_table
The result now includes the accuracy, false positive rate, and false negative rate:
Row | accuracy | fpr | fnr | THRESH | true_negatives | false_negatives | false_positives | true_positives | total |
1 | 0.85 | 0.44 | 0.04 | 5 | 3931979 | 144669 | 718590 | 918770 | 5714008 |
2 | 0.9 | 0.32 | 0.04 | 10 | 4242286 | 184944 | 408283 | 878495 | 5714008 |
3 | 0.9 | 0.29 | 0.04 | 11 | 4288279 | 193912 | 362290 | 869527 | 5714008 |
4 | 0.91 | 0.27 | 0.04 | 12 | 4329146 | 203068 | 321423 | 860371 | 5714008 |
5 | 0.91 | 0.25 | 0.05 | 13 | 4366641 | 212498 | 283928 | 850941 | 5714008 |
6 | 0.92 | 0.21 | 0.05 | 15 | 4430885 | 232701 | 219684 | 830738 | 5714008 |
7 | 0.93 | 0.12 | 0.06 | 20 | 4542475 | 291791 | 108094 | 771648 | 5714008 |
We want to cancel the meeting if the probability of the flight being late is more than 30%. If we are going to make the decision based on the departure delay, we have to choose a departure delay that is such that 30% of flights with that departure delay are late. Because we defined on-time arrival as the positive event, this implies that the false positive rate is 30%.
It is clear from the table above that if we want our decision to have a false positive rate of 30%, the departure delay threshold needs to be 10 or 11 minutes (in the dataset, departure delay is an integer, so an intermediate threshold like 10.3 minutes does not make sense).
If we choose a threshold of 10 minutes, we will make the correct decision 96% of the time when we don’t cancel the meeting and 68% of the time when we cancel the meeting. Overall, we are correct 90% of the time.
Note that 10 minutes is not the threshold that maximizes the overall accuracy. Had we chosen a threshold of 20 minutes, we’d cancel far fewer meetings (108k vs. 408k) and be correct more often overall (93%). However, that would be very conservative. Since it is not our goal to be correct 88% of the times we cancel the meeting – we only want to be correct 70% of the time – 10 minutes is the right threshold.
However, we could also consider that if we can increase the threshold to 20 minutes, we would be correct far more often with very little impact on the false negative rate. Until we looked at the data, we didn’t know what was achievable, and it is possible that the original target was set in a fog of uncertainty. It might be worthwhile asking our stakeholders whether they are really wedded to the 30% false negative rate, and whether we have leeway to change the tradeoffs available to users of our application – a dashboard that shows the impact of a threshold is an excellent way to gauge this. If the stakeholders don’t know, it might be worth doing an A/B test with a focus group, and that’s what we are about to do next.
Even this simple model is enough for us to begin getting feedback from end users. Recall that my gut instinct at the beginning of the previous section was that I needed to use a 15-minute threshold on the departure delay. Analysis of the contingency table, however, indicated that the right threshold to use was 10 minutes. I’m satisfied with this model as a first step, but will our end users be? Let’s go about building a dashboard that explains the model recommendations to end users. Doing so will also help clarify what I mean by explaining a model to end users.
There is a large number of business intelligence and visualization tools available, and many of them connect with data sources like BigQuery and Cloud SQL on Google Cloud Platform. In this chapter, we build dashboards using Data Studio, which is free and comes as part of Google Cloud Platform, but you should be able to do similar things with Tableau, QlikView, Looker, and so on.
To work with Data Studio, navigate to http://datastudio.google.com/. There are two key concepts in Data Studio: reports and data sources. A report is a set of charts and commentary that you create and share with others. The charts in the report are built from data that is retrieved from a data source. The first step, therefore, is to set up a data source. Because our data is in BigQuery, the data source we need to set up is for Data Studio to connect to BigQuery.
On the Data Studio home page, click on the Create button, click the Data source menu item, and choose the BigQuery button, as illustrated in Figure 3-8.7
Select your project, the
dsongcp
dataset and
flights
as the table. Then, click on the Connect button. Recall that
flights
is the view that we have set up with the streamlined set of fields.
A list of fields in the table displays, with Data Studio inferring something about the fields based on a sampling of the data in that table. We’ll come back and correct some of these, but for now, just click Create Report, accepting all the prompts.
On the top ribbon, select the scatter chart icon from the “Add a chart” pulldown (see Figure 3-10) and draw a rectangle somewhere in the main window; Data Studio will draw a chart. The data that is rendered is pulled from some rather arbitrary columns.
Ignoring the Date Range Dimension for now, there are three columns being used: the Dimension is the quantity being plotted; Metric X is along the x-axis; and Metric Y is along the y-axis. Change (if necessary) Dimension to UNIQUE_CARRIER, Metric X to DEP_DELAY, Metric Y to ARR_DELAY, and change the aggregation metric for both Metric X and Metric Y to Average. Ostensibly, this should give us the average departure delay and arrival delay of different carriers. Click the Style tab and add in a linear trendline and show the data labels. Figure 3-11 depicts the resulting chart.
So far, our chart is static—there is nothing for end users to interact with. They get to see a pretty picture, but do not get to change anything about our graph. To permit the end user to change something about the graph, we should add controls to our graph.
Let’s give our end users the ability to set a date range. On the top icon ribbon, click the “Date range” button, as illustrated in Figure 3-11.
On your chart, place the rectangle where you’d like the control to appear. Change the time window to be Fixed and set the Start Date to Jan 1, 2015 and end date to Dec 31, 2019 8 . This is how the report will initially appear to users.
In the upper-right corner, change the toggle to switch to the View mode. This is the mode in which users interact with your report. Change the data range to Jan 1, 2015 to May 31, 2015 (see Figure 3-12) and you should see the chart immediately update.
Pause a bit here and ask yourself what kind of a model the chart in Figure 3-12 would explain. Because there is a line, it strongly hints at a linear model. If we were to recommend meeting cancelations based on this chart, we’d be suggesting, based on the linear trend of arrival delay with departure delay, that departure delays of more than 20 minutes lead to arrival delays of more than 15 minutes. That, of course, was not our model—we did not do linear regression, and certainly not airline by airline. Instead, we picked a departure threshold based on a contingency table over the entire dataset. So, we should not use the graph above in our dashboard—it would be a misleading description of our actual model.
How would you explain our contingency table–based thresholds to end users in a dashboard? Recall that the choice comes down to the proportion of flights that arrive more than 15 minutes after their scheduled time. That is what our dashboard needs to show.
One of the best ways to show a proportion is to use a pie chart. 9 Switch back to the Edit mode, and from the pull-down menu, select the “donut chart” button (this is a type of pie chart), and then, on your report, draw a square where you’d like the donut chart to appear (it is probably best to delete the earlier scatter plot from it). As we did earlier, we need to edit the dimensions and metrics to fit what it is that we want to display. Perhaps things will be clearer if you see what the end product ought to look like. Figure 3-13 gives you a glimpse.
In this chart, we are displaying the proportion of flights that arrived late versus those that arrived on time. The labeled field ON TIME versus LATE is the Dimension. The number of flights is the metric that will be apportioned between the labels. So, how do you get these values from the BigQuery view?
It is clear that there is no column in the database that indicates the total number of flights. However, Data Studio has a special value Record Count that we can use as the metric, after making sure to change the aggregate from the default Sum to Count.
The “islate” value, though, will have to be computed as a formula. Conceptually, we need to add a new calculated field to the data that looks like this:
CASE WHEN (ARR_DELAY < 15) THEN "ON TIME" ELSE "LATE" END
Click on the current Dimension column and click on “Create Field”. Give the field the name
is_late
, enter the above formula, and change the type to Text (see Figure 3-14).
The pie chart is now complete and reflects the proportion of flights that are late versus those that are on time. You can switch over to the Style tab if you’d like to change the appearance of the pie chart to be similar to mine.
Because the proportion of flights that end up being delayed is the quantity on which we are trying to make decisions, the pie chart translates quite directly to our use case. However, it doesn’t tell the user what the typical delay would be. To do that, let’s create a bar (column) chart that looks like the one shown in Figure 3-15.
Here, the labeled quantity (or Dimension) is the Carrier. There are two metrics being displayed: the DEP_DELAY and ARR_DELAY, both of which are aggregated to their averages over the dataset. Figure 3-16 shows the specifications.
Note the Sort column at the end—it is important to have a reliable sort order in dashboards so that users become accustomed to finding the information they want in a known place. Also, the default is to use different axes for the two variables.
Switch over to the Style tab and change this to use a single axis. Finally, Data Studio defaults to 10 bars. In the Style tab, change this to reflect that we expect to have up to 20 unique carriers (Figure 3-17).
Of course, we can now add in a date control as we did earlier to end up with the report in Figure 3-18 (the All Flights in the diagram is just a text label that I added).
It appears that, on average, about 80% of flights are on time and that the typical arrival delay varies between airlines but lies in a range of 0 to 15 minutes.
Even though the dashboard we just created shows users the decision-making criterion (proportion of flights that will be late) and some characteristics of that decision (the typical arrival delay), it doesn’t actually show our model. Recall that our model involved a threshold on the departure delay. We need to show that. Figure 3-19 shows what we want the dashboard to look like.
In other words, we want to show the same two charts, but for the decision thresholds that we considered—departure delays of 10, 15, and 20 minutes or more.
To get there, we need to change our data source. No longer can we populate the chart from the entire table. Instead, we should populate it from a query that pulls only those flights whose departure delay is greater than the relevant threshold. In BigQuery, we can create the views we need 10 and use those views as data sources. Here’s how:
CREATE OR REPLACE VIEW dsongcp.delayed_10 AS SELECT * FROM dsongcp.flights WHERE dep_delay >= 10; CREATE OR REPLACE VIEW dsongcp.delayed_15 AS SELECT * FROM dsongcp.flights WHERE dep_delay >= 15; CREATE OR REPLACE VIEW dsongcp.delayed_20 AS SELECT * FROM dsongcp.flights WHERE dep_delay >= 20;
Looking at the resulting pie chart for a 10 minute-threshold (Figure 3-19), we see that it comes quite close to our target of 30% on-time arrivals. The bar chart for the 10-minute delay explains why the threshold is important. Hint: it is not about the 10 minutes. It is about what the 10-minute delay is indicative of. Can you decipher what is going on?
Although the typical departure delay of a flight is only about 5 minutes (see the chart corresponding to all flights that we created earlier), flights that are delayed by more than 10 minutes fall into a separate statistical regime. The typical departure delay of an aircraft that departs more than 10 minutes late is around 50 minutes! A likely explanation is that a flight that is delayed by 10 minutes or more typically has a serious issue that will not be resolved quickly. If you are sitting in an airplane and it is more than 10 minutes late in departing, you might as well cancel your meeting—you are going to be sitting at the gate for a while. 11
At this point, we have created a very simple model and created dashboards to explain the model to our end users. Our end users have a visual, intuitive way to see how often our model is correct and how often it is wrong. The model might be quite simple, but the explanation of why the model works is a satisfying one.
There is one teeny, tiny thing missing, though. Context. The dashboard that we have built so far is all about historical data, whereas real dashboards need to be timely. Our dashboard shows aggregates of flights all over the country, but our users will probably care only about the airport from which they are departing and the airport to which they are going. We have a wonderfully informative dashboard, but without such time and location context, few users would care. In Chapter 4, we look at how to build real-time, location-aware dashboards—unfortunately, however, there is a problem with our dataset that prevents us from doing so immediately.
In this chapter, we discussed the importance of bringing the insights of our end users into our data modeling efforts as early as possible. Bringing their insights is possible only if you make it a point to explain your models in context from the get-go.
We tried using Cloud SQL, a transactional, relational database whose management is simplified by virtue of it running on the cloud and being managed by Google Cloud Platform. However, it stopped scaling once we got to millions of flights. Transactional databases are not built for queries that involve scanning the entire table. For such queries, we want to use an analytics data warehouse. Hence, we switched to using BigQuery.
Within BigQuery, we previewed the table, selected a subset of columns and created a view to make downstream analysis simpler.
The first model that we built was to suggest that our road warriors cancel their immediately scheduled meeting if the departure delay of the flight was more than 10 minutes. This would enable them to make 70% of their meetings with 15 minutes to spare.
We then built a dashboard in Data Studio to explain the contingency table model. Because our choice of threshold was driven by the proportion of flights that arrived late given a particular threshold, we illustrated the proportion using a pie chart for two different thresholds. We also depicted the average arrival delay given some departure delay—this gives users an intuitive understanding of why we recommend a 10-minute threshold.
1 When I worked on developing machine learning algorithms for weather prediction, nearly every one of the suggestions and feature requests that I received emanated when the person in question was looking at the real-time radar feed. There would be a storm, my colleague would watch it go up on radar, observe that the tracking of the storm was patchy, and let me know what aspect of the storm made it difficult to track. Or, someone would wake up, look at the radar image, and discover that birds leaving to forage from their roost had been wrongly tagged a mesocyclone. It was all about the real-time data. No matter how many times I asked, I never ever got anyone to look at how the algorithms performed on historical data. It was also often about Oklahoma (where our office was) because that’s what my colleagues would concentrate on. Forecasters from around the country would derisively refer to algorithms that had been hypertuned to Oklahoma supercells.
2 Explanations can be a double-edged sword because humans are not fully rational beings. Explanations can be the result of apophenia – the tendency of humans to see meaningful patterns even when there are none. This can lead to motivated reasoning – the tendency of humans to create justifications for decisions that are more desirable at an emotional level. The combination of apophenia and motivated reasoning can lead to just-so stories that attempt to justify whatever the state of the world is on the basis of spurious explanations. As data scientists, we should realize that we too are human. We need to be careful to set aside these biases, consider counterfactuals, and be willing to revise our initial judgements. Easier said than done, of course.
3 If you are not familiar with design principles, I recommend The Visual Display of Quantitative Information by Edward Tufte.
4 The Economist is published weekly as an 80-page booklet stapled in the center, and each page is about the size of a letter paper. However, for historical reasons, the company refers to itself as a newspaper rather than a magazine.
5 http://www.economist.com/blogs/graphicdetail/2016/11/daily-chart-18, published on Nov. 25, 2016.
6 If your organization has set up a security policy to allow access only from authorized networks, you might have to use a SQL proxy to connect to the instance. At the time of writing, this is available only in the beta version, so do: gcloud beta sql connect flights --user=postgres
7 Graphical user interfaces are often the fastest-changing parts of any software. So, if the user interface has changed from these screenshots by the time this book gets into your hands, please hunt around a bit. There will be some way to add a new data source.
8 Or whatever the last month that you downloaded is. Just so that you don’t have to wait a long time for the data to be available in your Google Cloud project, the ingest.sh script in Chapter 2, by default, downloads only 2015 data. Change the YEAR loop in that script to download 2015 to 2019.
9 An alternative way to show proportions, especially of a time-varying whole, is a stacked column chart.
10 Data Studio does support a BigQuery query as a data source, but it is preferable to read from a view because views are more reusable.
11 Road warriors know this well. Ten minutes in, and they whip out their phones to try to get on a different flight.
44.200.94.150