As a data scientist, you must solve essential problems in an economical fashion, taking into account all constraints and functional requirements. The biggest mistake is to get emotionally attached to some particular technique and try to find “suitable” problems to which to apply it; this is pseudo-science, at best. A similar issue arises when you attempt to apply some convoluted approach where a simpler and more elegant method exists. I’ve seen these two recurring themes myriad times. For example, logistic regression (an intuitive technique) often can outperform kernelized support vector machines in classification tasks, and a bit of cleverness in feature engineering may completely obviate the need for an opaque neural network–oriented approach. Start simple and keep it simple is perhaps the best advice for any data scientist. In order to achieve simplicity, you must invest an enormous amount of energy and time, but the end result will be more than rewarding (especially when including long-term benefits). Something that goes along with simplicity is pragmatism—good enough is usually better than perfect in most scenarios. This chapter will exemplify a bias toward simplicity and pragmatism as well as acquaint you with the notion of complexity and heuristics. It contains many small use cases, each highlighting a particular topic pertaining to succinctness and effectiveness of a data product.
- 1.
Understanding the problem
- 2.
Devising a plan
- 3.
Carrying out the plan
- 4.
Reviewing and reflecting on the outcome
Without context-awareness, you would encounter disorder and make wrong judgments. There is also a concomitant consequence of applying the Cynefin framework. Namely, to know your current domain, you must think about measurements and metrics. For example, attaining a proper performance characteristic could move you from the Simple domain into the Complicated or Complex domain. However, you will know why you are there and what options you have at your disposal. In other words, you will need to make a deliberate decision about switching a decision-making boundary.
In the center of Figure 11-1 is a special area called Disorder; this represents the situation in which you don’t know the right domain yet, and hence must explore further. Each quadrant is linked to a separate problem-solving method instance. The Simple domain calls for a best practice, emphasizing the fact that this is well-charted territory. The Complicated domain demands a good practice, which may sometimes be an optimal solution. The Complex domain designates an emergent property of a solution, where you may apply various heuristics to reach a suitable resolution. The Chaos domain requires you to take immediate action and only later analyze the cause and effect relationships. Notice that this framework reverses the terms Complicated and Complex compared to the Zen of Python (“Complex is better than complicated,” as shown in Figure 3-2 in Chapter 3).
The base class BaseEstimator implements auxiliary methods for getting and setting parameters of the estimator instance. The fit, fit_predict, predict, and predict_proba methods are part of that unifying API, which allows you to treat different estimators in the same way. The MyEstimator follows Python’s Duck typing convention to deliver the accustomed interface.
From Simple to Complicated
This section presents three major strategies to cope with the Complicated domain: improving the asymptotic running time of an algorithm, applying randomization and sampling, and using parallel/distributed computing. All of them should be potential options in crafting your next data science product. It is very important to start with the simplest solution and gradually enhance it. For this you would need a complete performance test harness, which we are going to showcase here. I have selected the next couple of examples because they are as straightforward as possible while still conveying the main point.
Counting the Occurrences of a Digit
Dumb Algorithm That Works Reasonably Well If n Is Under 107
Performance Test Harness to Track Time Measurements and Visualize Them
The O(n) model is a slight underestimation, since the body of the loop in Listing 11-1 does depend on the input size. Nevertheless, for very large input sizes (when the running time would be better modeled as O(n log n)), we cannot even wait for the test to finish.
Improved Logarithmic Version That Uses an Additional O(log n) Memory Space.
This case study has demonstrated a situation in which speedup is possible without jeopardizing accuracy. Moreover, the improved code doesn’t require any additional infrastructural support. Trying to find a better algorithm should be your top priority (before thinking about Hadoop, Spark, etc.).
Estimating the Edge Betweenness Centrality
For the sake of terseness, only the first part of the code is repeated, which slightly differs from the original version. Notice the section for evaluating est_main_conns by using 40% of nodes. The karate club network is very small, and sampling is truly beneficial with huge networks. Nonetheless, the principle is the same.
This case study has illustrated how you may exchange accuracy for better performance. This sort of randomization belongs to the class of Monte Carlo methods. The Las Vegas model always produces an accurate result, but in some very rare incidents it may run slower (a good example is quick sort). At any rate, the new code runs faster without reaching out for a computing cluster (see Exercise 11-2).
The Count of Divisible Numbers
There is an enormous number of triples (a, b, c) that you need to address.
These cannot fit simultaneously in your memory.
These conditions are good telltale signs of the necessity to consider a shift in technology, where applying parallel and distributing computing is indeed justifiable. You can leverage the Dask framework to handle use cases associated with the previous constraints. Listing 11-5 illustrates the core idea behind leveraging a Dask array. The example is tuned down to its essential ingredients so that you can easier comprehend the underlying mechanism (for example, in practice you would work with larger chunks).
Dask provides its own version of array and dataframe data structures, whose APIs mimic those found in Numpy and Pandas.
This case study has demonstrated that parallel and distributed computing is a powerful choice, once you really know that you need it.
From Disorder to Complex
The Complicated domain is characterized by known unknowns, while the Complex domain is about unknown unknowns. Disorder should be a temporary state while you explore the problem space to understand where it belongs. In this respect, Disorder is the catalyst for exploratory data analysis.
High-dimensional datasets are the most probable reason for initial confusion. Classical visualization doesn’t properly operate here, since you don’t even know how to represent those dimensions. With high-dimensional data there are lots of duplicated and uninformative features (for example, those that have extremely low or zero variance). These should be dropped from the corpus. Furthermore, many of them will turn out to be unrelated to the target variable, and as such should also be removed as irrelevant. The primary aim is to reduce dimensionality, so that you can see the forest for the trees. The gold standard is the principal component analysis (PCA) method, which applies feature extraction to describe your data in terms of orthogonal principal components.
Note
Any approach (SVD, PCA, spectral methods, etc.) that ends up using eigenvalues and eigenvectors inevitably falls into the Complex domain. The main difficulty with these practices is that their output is usually uninterpretable. You can surmise what each eigenvector means, but surely cannot decipher them with certainty.
In this section, we will focus our attention solely on a special visualization technique called t-SNE to glimpse into high-dimensional data; see reference [2] for a nice interactive introduction to this topic as well as how to avoid common pitfalls. In a sense, this is a meta application of the Cynefin framework, since we must consider visualization as Complex/Complicated instead of treating it as Simple.
As a side note, you should also try to peek into the dataset using standard techniques, such as using the describe method on the Pandas dataframe. Columns whose variance is near zero are good candidates for removal. Moreover, you can pairwise visualize columns using the pairplot function in Seaborn. Highly correlated columns are redundant, and you should avoid them. These actions are all associated with feature selection. Part of this can also be automated. For example, you can use the sklearn.feature_selection module that implements feature selection algorithms. For pruning low-variance features, you may use the VarianceThreshold class . At any rate, all these methods are OK for low-dimensional datasets (those with less than ten columns/features).
Feature extraction is complementary to feature selection when new features are created. For example, arranging instances into bins (a process called binning) is an example of creating new categorical features. We have already encountered this in Chapter 2, where we grouped users by age range.
Exploring the KDD Cup 1999 Data
This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between “bad” connections, called intrusions or attacks, and “good” normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.
Code to Showcase the Power of t-SNE, although you must be patient to try out many different parameter settings (learning rate, random state, perplexity, number of t-SNE components, etc.).
The protocol_type symbolic variable is used to mark various groups (observe the green icmp cluster on the right), and there is a good separation in behavior between them. Once again, t-SNE was able to squeeze out these patterns without looking at the protocol_type feature.
This case study has illuminated the importance of doing exploratory data analysis as part of figuring out into which Cynefin domain your data science problem belongs. Once you know that you need to perform feature extraction and related dimensionality reduction, then you can proceed further by applying PCA or a similar technique. Reducing dimensionality aids your machine learning efforts, since the reduced dataset doesn’t require huge storage or processing time, and you reduce the danger of overfitting your model.
Cynefin and Data Science
Simple: Nonpersonalized recommenders offer products based on their overall popularity or try to perform various types of association analysis depending on what the current user is doing. For example, if the user is browsing products of a particular type, then the recommender may offer similar products. If a user has put an item into her basket, then the recommender may offer other products commonly bought together with the one inside the basket. This is a typical case when you see on a web page the section “Customers who bought this also bought...”
Complicated: Personalized recommenders utilizing both collaborative and content-based filtering are more intricate and useful than their nonpersonalized counterparts. They try to model customers by developing their profiles. Nowadays, most recommendation engines belong, at least, in this domain.
Complex: Multidimensional and context-aware hybrid recommenders are even more powerful. For example, these recommenders may suggest content based on a user’s current mood. This domain seeks out-of-the-box solutions and frequently requires the application of heuristics and approximation algorithms.
Exercise 11-1. Understanding an Algorithm
Try to fully understand how this program works. Apparently, any aggressive performance tuning thwarts the maintainability of your code base. Nonetheless, if you can exponentially speed up your algorithm, that would displace the necessity to utilize expensive and convoluted distributed solutions.
Bonus: Can you generalize the program to handle numbers in an arbitrary base (for example, base 16)? This shouldn’t be hard once you have figured out the preceding code. This is the point where the former naive algorithm would immediately break (the search space dramatically increases in bases larger than 10).
Exercise 11-2. Data Streaming
Suppose you need to produce a histogram of a data stream (i.e., depict the frequency distribution of values in a stream). This sounds like just another counting task. After all, it is simply a matter of using Numpy’s histogram function for an array of values. The only catch is that the stream is infinite, and values are coming continuously. There is no way to store them in memory. This is another scenario in which you need to change a Cynefin domain from Simple to Complicated.
One very popular algorithm for achieving this is called Count Sketch. This is a probabilistic Monte Carlo–style data structure. The idea is to use a 2D array and a set of universal hash functions to count items. Based on the values of counters, you can estimate the frequencies. For more information, read http://bit.ly/count-sketch .
Implement a Count-Sketch data structure or find an open-source project and try out this approach. You can simulate a stream using Python’s generator function.
Exercise 11-3. Examining Chaos
The Chaos domain requires instantaneous actions. A good example is an anomaly detection system that must react as soon as it detects an abnormal behavior. The KDD Cup 1999 Data is perfect to practice building an automated intrusion detection facility. In this exercise you will do something lighter, but equally instructive.
Hook’s law in physics models the force instigated by a spring with some elasticity k when you stretch it by some distance. The formula is F = − k∆x, where the minus sign denotes an opposite force in relation to the direction of stretching. The spring constant k is unique for each spring.
Suppose that you have collected a set of readings of a force for various ∆x displacements. Our model of a spring will be correct as far as the spring is operational. Unfortunately, if you stretch the spring too much, you can damage it. From that moment on you will no longer be able to use the model.
Implement an anomaly detection application for a spring. Your system should prerecord data for regular operational conditions to be able to discern anomalies. You may want to take a look at the sklearn.neighbors.LocalOutlierFactor class. When the spring is broken, it will not generate data points near to those from the normal regime. This is the opportunity for your classifier to trigger a warning. You should understand that false positives (invalid alarms) will occur. After all, maybe during normal usage nobody ever tried to push the spring to its limits. Even though nothing bad happened, your system may think that something went wrong. This is the essential property of the Chaos domain. You don’t have too much time to think before you act.
Summary
Knowing when a particular approach is proper and admissible is the cornerstone of being a successful data scientist. You must resist the hype and temptation to build your product around a novel technology for the sake of trying things out. Algorithms are at the heart of dealing with Big Data, together with mathematical statistics. This chapter has illustrated how the Cynefin framework provides context for better decision-making and choosing the right solution.
References
- 1.
George Pólya, How to Solve It: A New Aspect of Mathematical Method, Expanded Princeton Science Library Edition, Princeton University Press, 2004.
- 2.
Wattenberg, et al., “How to Use t-SNE Effectively,” Distill, https://distill.pub/2016/misread-tsne , Oct. 3, 2016.