© Ervin Varga 2019
E. Varga Practical Data Science with Python 3https://doi.org/10.1007/978-1-4842-4859-1_11

11. Complexity and Heuristics

Ervin Varga1 
(1)
Kikinda, Serbia
 

As a data scientist, you must solve essential problems in an economical fashion, taking into account all constraints and functional requirements. The biggest mistake is to get emotionally attached to some particular technique and try to find “suitable” problems to which to apply it; this is pseudo-science, at best. A similar issue arises when you attempt to apply some convoluted approach where a simpler and more elegant method exists. I’ve seen these two recurring themes myriad times. For example, logistic regression (an intuitive technique) often can outperform kernelized support vector machines in classification tasks, and a bit of cleverness in feature engineering may completely obviate the need for an opaque neural network–oriented approach. Start simple and keep it simple is perhaps the best advice for any data scientist. In order to achieve simplicity, you must invest an enormous amount of energy and time, but the end result will be more than rewarding (especially when including long-term benefits). Something that goes along with simplicity is pragmatism—good enough is usually better than perfect in most scenarios. This chapter will exemplify a bias toward simplicity and pragmatism as well as acquaint you with the notion of complexity and heuristics. It contains many small use cases, each highlighting a particular topic pertaining to succinctness and effectiveness of a data product.

Addressing problems while also fully satisfying customers demands two things: a general problem-solving ability and a set of rules and principles that puts searching for a solution into a broader context. The former revolves around the next four steps (see reference [1]), which usually constitute a never-ending cycle if new features are later added to a product:
  1. 1.

    Understanding the problem

     
  2. 2.

    Devising a plan

     
  3. 3.

    Carrying out the plan

     
  4. 4.

    Reviewing and reflecting on the outcome

     
This strategy is well known and understood by many practitioners. What is usually lesser known is the importance of defining and accepting a framework that will provide systematic guidance in problem-solving endeavors. I prefer the Cynefin framework (shown in Figure 11-1), which is a conceptual model to aid decision-making. There is a tight association between problem solving and decision-making, so forming an ensemble to cope with both topics is vital. Knowing the current domain of the framework helps you in selecting an optimal tactic; you likely would be comfortable choosing a heuristics inside the Complex domain, and reluctant to apply such a thing in the Simple domain. A heuristic is a technique that you use to exchange an absolutely deterministic and accurate solution with something that does the job adequately. Surprisingly, approximate solutions frequently behave as well as fully worked-out methods. Sometimes, the input itself is so noisy that accuracy above some threshold makes no sense.
../images/473947_1_En_11_Chapter/473947_1_En_11_Fig1_HTML.jpg
Figure 11-1

The four quadrants of the Cynefin framework (along with Disorder in the center): Simple, Complicated, Complex, and Chaos

Without context-awareness, you would encounter disorder and make wrong judgments. There is also a concomitant consequence of applying the Cynefin framework. Namely, to know your current domain, you must think about measurements and metrics. For example, attaining a proper performance characteristic could move you from the Simple domain into the Complicated or Complex domain. However, you will know why you are there and what options you have at your disposal. In other words, you will need to make a deliberate decision about switching a decision-making boundary.

In the center of Figure 11-1 is a special area called Disorder; this represents the situation in which you don’t know the right domain yet, and hence must explore further. Each quadrant is linked to a separate problem-solving method instance. The Simple domain calls for a best practice, emphasizing the fact that this is well-charted territory. The Complicated domain demands a good practice, which may sometimes be an optimal solution. The Complex domain designates an emergent property of a solution, where you may apply various heuristics to reach a suitable resolution. The Chaos domain requires you to take immediate action and only later analyze the cause and effect relationships. Notice that this framework reverses the terms Complicated and Complex compared to the Zen of Python (“Complex is better than complicated,” as shown in Figure 3-2 in Chapter 3).

The situation may change over time, and this could cause shifts from one domain into another. New requirements and constraints may demand a different solution, so transitioning happens in counter-clockwise manner. As you gain more knowledge and experience, you could devise simpler solutions, and thus move toward other domains in clockwise direction. Hiding complexity by introducing powerful abstractions in the form of an application programming interface is a viable tactic to simplify things for clients. This means that different stakeholders may look at the whole problem-solution space from different viewpoints. Maybe the best example of this is the generic estimator API of scikit-learn. You may manage many machine learning algorithms in a unified fashion, without worrying about the underlying implementation details, as shown in Figure 11-2.
../images/473947_1_En_11_Chapter/473947_1_En_11_Fig2_HTML.jpg
Figure 11-2

An estimator is an object that handles training and usage of a model

The base class BaseEstimator implements auxiliary methods for getting and setting parameters of the estimator instance. The fit, fit_predict, predict, and predict_proba methods are part of that unifying API, which allows you to treat different estimators in the same way. The MyEstimator follows Python’s Duck typing convention to deliver the accustomed interface.

From Simple to Complicated

This section presents three major strategies to cope with the Complicated domain: improving the asymptotic running time of an algorithm, applying randomization and sampling, and using parallel/distributed computing. All of them should be potential options in crafting your next data science product. It is very important to start with the simplest solution and gradually enhance it. For this you would need a complete performance test harness, which we are going to showcase here. I have selected the next couple of examples because they are as straightforward as possible while still conveying the main point.

Counting the Occurrences of a Digit

Suppose you need to count how many times a particular digit d appears in numbers inside an interval [1, n], where n ∈ ℕ ∧ n ≤ 1050. You may start with the naive approach depicted in Listing 11-1. Don’t underestimate the utility of such a brute-force program. It may serve well as a test generator (source of truth) for more advanced solutions.
k, n = tuple(map(int, input().split()))
def count_occurrences_digit_naive(k, n):
    k = str(k)
    count = 0
    for i in range(n + 1):
        count += str(i).count(k)
    return count
print(count_occurrences_digit_naive(k, n))
Listing 11-1

Dumb Algorithm That Works Reasonably Well If n Is Under 107

According to the nonfunctional requirements (users should not wait for decades to get the result), we definitely cannot accept this program. It is simply too slow, with roughly linear running time, so we must move beyond the Simple case, as suggested by the Cynefin framework. This is the moment to build the proper infrastructure around performance testing. A similar job is needed to test feeding data into your product, which is part of the data engineering effort. We will use visualization to track time measurements. Listing 11-2 shows the facility to test comparatively different program variants. The functions shown in bold comprise the reusable pieces of this script. The driver code is located inside the conditional block at the end. All you need to do is provide a proper configuration. The attach_model function finds the right constant that goes along with the base model.
import time
import numpy as np
import pandas as pd
def measure(f, num_repetitions=5):
    measurements = np.array([])
    for _ in range(num_repetitions):
        start = time.clock()
        f()
        measurements = np.append(measurements, time.clock() - start)
    return measurements.mean()
def execute(config):
    execution_times = {}
    for config_name in config['functions']:
        execution_times[config_name] = np.array([])
    for x in config['span']:
        for config_name in config['functions']:
            execution_times[config_name] = np.append(
                    execution_times[config_name],
                    measure(lambda: config['functions'][config_name](x)))
    return execution_times
def attach_model(execution_times, config, function_name, model_name):
    model_vals = np.vectorize(config['models'][model_name])(config['span'])
    c = np.mean(execution_times[function_name] / model_vals)
    execution_times[model_name] = c ∗ model_vals
def report(execution_times, x_vals, ∗∗plot_kwargs):
    df = pd.DataFrame(execution_times)
    df.index = x_vals
    ax = df.plot.line(
        figsize=(10, 8),
        title='Performance Test Report',
        grid=True,
        **plot_kwargs
    )
    ax.set_xlabel('Span')
    ax.set_ylabel('Time [s]')
    return df
if __name__ == '__main__':
    from count_occurrences_digit_naive import count_occurrences_digit_naive
    config = {
        'functions': {
            'naive(k=0)': lambda n: count_occurrences_digit_naive(0, n)
        },
        'models': {
            'O(n)': lambda n: n
        },
        'span': np.geomspace(10∗∗2, 10∗∗7, num=14, dtype=int)
    }
    execution_times = execute(config)
    attach_model(execution_times, config, 'naive(k=0)', 'O(n)')
    print(report(execution_times, config['span'], logx=True, style=['-ro', ':r^']))
Listing 11-2

Performance Test Harness to Track Time Measurements and Visualize Them

After executing this code you will get the following report as well as the line plot, as shown in Figure 11-3:
          naive(k=0)      O(n)
100         0.000073  0.000050
242         0.000111  0.000121
587         0.000251  0.000293
1425        0.000608  0.000712
3455        0.001539  0.001725
8376        0.003793  0.004183
20309       0.009700  0.010141
49238       0.023221  0.024587
119377      0.060661  0.059611
289426      0.145676  0.144526
701703      0.361976  0.350397
1701254     0.872631  0.849525
4124626     2.182770  2.059641
10000000    5.334004  4.993522
../images/473947_1_En_11_Chapter/473947_1_En_11_Fig3_HTML.jpg
Figure 11-3

The line plot of running times for the naive algorithm and its associated O(n) model. The x axis is on logarithmic scale.

The O(n) model is a slight underestimation, since the body of the loop in Listing 11-1 does depend on the input size. Nevertheless, for very large input sizes (when the running time would be better modeled as O(n log n)), we cannot even wait for the test to finish.

Listing 11-3 shows a much faster O(log n) algorithm that is a result of analyzing the recurring pattern for the number of digits in ranges [1, 10i − 1], ∀ i ∈ ℕ ∧ i ≤ 50. Digits from one to nine produce the same pattern, while zero is a bit different (we need to ignore leading zeros, so the formula is somewhat more challenging). After changing the driver to code to include this faster variant, we get as the line plot shown in Figure 11-4. The difference in execution time is indeed startling. You may modify the test run to cover the full range (you should omit the naive algorithm, as it cannot perform acceptably when n > 107). It is not hard to generalize the code to handle numbers in arbitrary base (like 16).
../images/473947_1_En_11_Chapter/473947_1_En_11_Fig4_HTML.jpg
Figure 11-4

The execution times for the new algorithm are so small that they are all represented as zero at this scale. The O(log n) model is fully aligned with the measured values.

def setup():
    MAX_EXPONENT = 50
    # Holds the number of occurrences of digit k in [0, (10**i) - 1], i > 0.
    # If the range is partial (first part of the composite key is False), then
    # leading zeros are omitted (this is a special case when k == 0).
    table_of_occurrences = {(False, 0): 0, (False, 1): 1,
                             (True, 0): 0, (True, 1): 1}
    for i in range(2, MAX_EXPONENT + 1):
        table_of_occurrences[(True, i)] = i * 10**(i - 1)
        table_of_occurrences[(False, i)] =
            10**(i - 1) + 10 * table_of_occurrences[(False, i - 1)] - 10
    return table_of_occurrences
def count_occurrences_digit(k, n, table_of_occurrences=setup()):
    digits = str(n)
    num_digits = len(digits)
    count = 0
    is_first_digit = num_digits > 1
    for digit in map(int, digits):
        span = 10**(num_digits - 1)
        count += (digit - 1) * table_of_occurrences[(True, num_digits - 1)]
        count += table_of_occurrences[(k != 0 or not is_first_digit, num_digits - 1)]
        if digit > k:
            if k > 0 or not is_first_digit:
                count += span
        elif digit == k:
            count += (n % span) + 1
        num_digits -= 1
        is_first_digit = False
    return count
if __name__ == '__main__':
    k, n = tuple(map(int, input().split()))
    print(count_occurrences_digit(k, n))
Listing 11-3

Improved Logarithmic Version That Uses an Additional O(log n) Memory Space.

This case study has demonstrated a situation in which speedup is possible without jeopardizing accuracy. Moreover, the improved code doesn’t require any additional infrastructural support. Trying to find a better algorithm should be your top priority (before thinking about Hadoop, Spark, etc.).

Estimating the Edge Betweenness Centrality

In Chapter 10, as part of graph analysis, we examined some centrality metrics for a social network. Among them was the edge betweenness centrality. Calculating the exact value for this metric is CPU intensive. When the network is large and you are interested only in getting a good enough result, then sampling is a viable option. The idea is very simple: You calculate the metric based upon N samples instead over a whole network. This usually gives you nearly identical results, although in a much shorter time. NetworkX has built-in support for this approach. Listing 11-4 shows the code for visually comparing the absolutely accurate outcome with that from sampling. You may want to play around with the code and increase the number of samples. Obviously, as the sample size increases, you will get more accurate figures, but the running time will also increase. You need to experiment to attain the sweet spot.
import operator
import networkx as nx
import matplotlib.pyplot as plt
G = nx.karate_club_graph()
node_colors = ['orange' if props['club'] == 'Officer' else 'blue'
               for _, props in G.nodes(data=True)]
node_sizes = [180 * G.degree(u) for u in G]
plt.figure(figsize=(10, 10))
pos = nx.kamada_kawai_layout(G)
nx.draw_networkx(G, pos,
                 node_size=node_sizes,
                 node_color=node_colors, alpha=0.8,
                 with_labels=False,
                 edge_color='.6')
# Calculating the absolute edge betweenness centrality.
main_conns = nx.edge_betweenness_centrality(G, normalized=True)
main_conns = sorted(main_conns.items(), key=operator.itemgetter(1), reverse=True)[:5]
main_conns = tuple(map(operator.itemgetter(0), main_conns))
nx.draw_networkx_edges(G, pos, edgelist=main_conns, edge_color="green", alpha=0.5, width=6)
# Estimating the edge betweenness centrality by sampling 40% of nodes.
NUM_SAMPLES = int(0.4 * len(G))
est_main_conns = nx.edge_betweenness_centrality(G, k=NUM_SAMPLES, normalized=True, seed=10)
est_main_conns = sorted(est_main_conns.items(), key=operator.itemgetter(1), reverse=True)[:5]
est_main_conns = tuple(map(operator.itemgetter(0), est_main_conns))
nx.draw_networkx_edges(G, pos, edgelist=est_main_conns,
                       edge_color='red', alpha=0.9, width=6, style="dashed")
Listing 11-4

For the sake of terseness, only the first part of the code is repeated, which slightly differs from the original version. Notice the section for evaluating est_main_conns by using 40% of nodes. The karate club network is very small, and sampling is truly beneficial with huge networks. Nonetheless, the principle is the same.

Observe that I’ve set the seed parameter. This is crucial in order for others to reproduce the results using randomization. Figure 11-5 shows the new graph.
../images/473947_1_En_11_Chapter/473947_1_En_11_Fig5_HTML.jpg
Figure 11-5

The dashed red edges are the ones calculated via sampling. Most of them overlap with the original edges, and only one is misconstrued.

This case study has illustrated how you may exchange accuracy for better performance. This sort of randomization belongs to the class of Monte Carlo methods. The Las Vegas model always produces an accurate result, but in some very rare incidents it may run slower (a good example is quick sort). At any rate, the new code runs faster without reaching out for a computing cluster (see Exercise 11-2).

The Count of Divisible Numbers

Suppose you are given a task to find all natural numbers in the range [a, b] (where 0 < a < b < 1018) that are divisible by a natural number c ≤ 1018. Listing 11-5 shows the num_divisible function that solves this problem. It is about as fast as it can be, meaning you cannot considerably speed it up. At this moment you could even stop had there been no additional requirements, like these two:
  • There is an enormous number of triples (a, b, c) that you need to address.

  • These cannot fit simultaneously in your memory.

These conditions are good telltale signs of the necessity to consider a shift in technology, where applying parallel and distributing computing is indeed justifiable. You can leverage the Dask framework to handle use cases associated with the previous constraints. Listing 11-5 illustrates the core idea behind leveraging a Dask array. The example is tuned down to its essential ingredients so that you can easier comprehend the underlying mechanism (for example, in practice you would work with larger chunks).

Dask delivers constructs (delayed tasks and futures) to parallelize custom applications. Listing 11-5 divides the input array into chunks and processes them as separate blocks. This allows them to be computed in parallel. Dask also handles all the mappings between your array and external data in a transparent fashion. A Dask array is a virtual view into a grid of raw Numpy arrays (the same is true for a Dask dataframe in relation to underlying Pandas dataframes). You work like everything is inside the memory of your machine.
import numpy as np
import dask.array as da
def num_divisible(a, b, c):
    r = a % c
    if r == 0:
        start = a
    else:
        start = a + (c - r)
    if start > b:
        return 0
    else:
        return 1 + (b - start) // c
num_divisible_vect = np.vectorize(num_divisible)
x = da.asanyarray([(1, 100, 10), (16789, 445267839, 7), (34, 10**18, 3000), (3, 7, 9)])
x = x.rechunk(chunks=(2, -1))
y = x.map_blocks(lambda block: num_divisible_vect(*block.T),
                 chunks=(-1,),
                 drop_axis=1,
                 dtype='i8')
print(y.compute())
Listing 11-5

Dask provides its own version of array and dataframe data structures, whose APIs mimic those found in Numpy and Pandas.

By running the preceding program, you will receive the following output, which is an array of four values, one for each input triple:
[             10        63607293 333333333333333             0]

This case study has demonstrated that parallel and distributed computing is a powerful choice, once you really know that you need it.

From Disorder to Complex

The Complicated domain is characterized by known unknowns, while the Complex domain is about unknown unknowns. Disorder should be a temporary state while you explore the problem space to understand where it belongs. In this respect, Disorder is the catalyst for exploratory data analysis.

High-dimensional datasets are the most probable reason for initial confusion. Classical visualization doesn’t properly operate here, since you don’t even know how to represent those dimensions. With high-dimensional data there are lots of duplicated and uninformative features (for example, those that have extremely low or zero variance). These should be dropped from the corpus. Furthermore, many of them will turn out to be unrelated to the target variable, and as such should also be removed as irrelevant. The primary aim is to reduce dimensionality, so that you can see the forest for the trees. The gold standard is the principal component analysis (PCA) method, which applies feature extraction to describe your data in terms of orthogonal principal components.

Note

Any approach (SVD, PCA, spectral methods, etc.) that ends up using eigenvalues and eigenvectors inevitably falls into the Complex domain. The main difficulty with these practices is that their output is usually uninterpretable. You can surmise what each eigenvector means, but surely cannot decipher them with certainty.

In this section, we will focus our attention solely on a special visualization technique called t-SNE to glimpse into high-dimensional data; see reference [2] for a nice interactive introduction to this topic as well as how to avoid common pitfalls. In a sense, this is a meta application of the Cynefin framework, since we must consider visualization as Complex/Complicated instead of treating it as Simple.

As a side note, you should also try to peek into the dataset using standard techniques, such as using the describe method on the Pandas dataframe. Columns whose variance is near zero are good candidates for removal. Moreover, you can pairwise visualize columns using the pairplot function in Seaborn. Highly correlated columns are redundant, and you should avoid them. These actions are all associated with feature selection. Part of this can also be automated. For example, you can use the sklearn.feature_selection module that implements feature selection algorithms. For pruning low-variance features, you may use the VarianceThreshold class . At any rate, all these methods are OK for low-dimensional datasets (those with less than ten columns/features).

Feature extraction is complementary to feature selection when new features are created. For example, arranging instances into bins (a process called binning) is an example of creating new categorical features. We have already encountered this in Chapter 2, where we grouped users by age range.

Exploring the KDD Cup 1999 Data

As stated on the official KDD Cup 1999 Data web site (see https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html ):
  • This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between “bad” connections, called intrusions or attacks, and “good” normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment.

Listing 11-6 shows t-SNE in action for revealing some patterns in the KDD Cup dataset (scikit-learn is packaged with full support for this data). There are 41 features in total. Among them, 34 are numerical, which are only acceptable by t-SNE. Categorical variables should be deselected or encoded. The t-SNE method is very susceptible to even tiny changes. The two auxiliary methods in Listing 11-6 download the column descriptors from the official web site (see also Figure 11-6).
from sklearn.datasets import fetch_kddcup99
from sklearn.manifold import TSNE
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
def retrieve_column_desc():
    import requests
    r = requests.get('https://kdd.ics.uci.edu/databases/kddcup99/kddcup.names')
    column_desc = {}
    for row in r.text.split(' ')[1:]:
        if row.find(':') > 0:
            col_name, col_type = row[:-1].split(':')
            column_desc[col_name] = col_type.strip()
    return column_desc
def get_numeric_columns(column_desc):
    return [name for name in column_desc if column_desc[name] == 'continuous']
column_desc = retrieve_column_desc()
numeric_columns = get_numeric_columns(column_desc)
print('Number of numeric columns:', len(numeric_columns))
X, _ = fetch_kddcup99(subset='SA', random_state=10, return_X_y=True)
X = pd.DataFrame(X, columns=column_desc.keys())
X[numeric_columns] = X[numeric_columns].apply(pd.to_numeric)
# We need to work on a small sample to get results in any reasonable time frame.
X = X.sample(frac=0.05, random_state=10)
m = TSNE(learning_rate=150, random_state=10)
X_tsne = m.fit_transform(X[numeric_columns])
print('First 10 rows of the TSNE reduced dataset:')
print(X_tsne[:10, :])
X['t-sne_1'] = X_tsne[:, 0]
X['t-sne_2'] = X_tsne[:, 1]
sns.set(rc={'figure.figsize': (10, 10)})
sns.scatterplot(x='t-sne_1', y='t-sne_2',
                hue='protocol_type', style="protocol_type",
                data=X[numeric_columns + ['protocol_type', 't-sne_1', 't-sne_2']])
plt.show()
Listing 11-6

Code to Showcase the Power of t-SNE, although you must be patient to try out many different parameter settings (learning rate, random state, perplexity, number of t-SNE components, etc.).

../images/473947_1_En_11_Chapter/473947_1_En_11_Fig6_HTML.jpg
Figure 11-6

Remarkably, t-SNE has been able to condense 34 features into 2 components, which clearly highlight pertinent patterns in data

The protocol_type symbolic variable is used to mark various groups (observe the green icmp cluster on the right), and there is a good separation in behavior between them. Once again, t-SNE was able to squeeze out these patterns without looking at the protocol_type feature.

This case study has illuminated the importance of doing exploratory data analysis as part of figuring out into which Cynefin domain your data science problem belongs. Once you know that you need to perform feature extraction and related dimensionality reduction, then you can proceed further by applying PCA or a similar technique. Reducing dimensionality aids your machine learning efforts, since the reduced dataset doesn’t require huge storage or processing time, and you reduce the danger of overfitting your model.

Cynefin and Data Science

Some of the previous examples may seem abstract to you, so this section illustrate Cynefin domains through a concrete data science product. Chapter 8 already introduced recommender systems, so we will use them here, too. The following list exemplifies the type of recommendation engine suitable for the first three Cynefin quadrants:
  • Simple: Nonpersonalized recommenders offer products based on their overall popularity or try to perform various types of association analysis depending on what the current user is doing. For example, if the user is browsing products of a particular type, then the recommender may offer similar products. If a user has put an item into her basket, then the recommender may offer other products commonly bought together with the one inside the basket. This is a typical case when you see on a web page the section “Customers who bought this also bought...”

  • Complicated: Personalized recommenders utilizing both collaborative and content-based filtering are more intricate and useful than their nonpersonalized counterparts. They try to model customers by developing their profiles. Nowadays, most recommendation engines belong, at least, in this domain.

  • Complex: Multidimensional and context-aware hybrid recommenders are even more powerful. For example, these recommenders may suggest content based on a user’s current mood. This domain seeks out-of-the-box solutions and frequently requires the application of heuristics and approximation algorithms.

Exercise 11-1. Understanding an Algorithm

For a given closed interval [a, b], where 0 <= a <= b <= 108, the following naive algorithm finds all the numbers with nonrepeating digits and outputs their count:
a, b = tuple(map(int, input().split()))
count = 0
for i in range(a, b + 1):
    s = str(i)
    if len(set(s)) == len(s):
        count += 1
print(count)
If you provide numbers 0 and 1000, you will get the count of 739. This is a simple but inefficient program with O((b − a) log b) behavior. The following is a much faster O((logb)3) algorithm:
import math
a, b = tuple(map(int, input().split()))
def variation_without_repetition(n, k):
    return math.factorial(n) // math.factorial(n - k)
# Finds how many numbers with nonrepeating digits are present in [0, k].
def count_numbers_with_non_repeating_digits(k):
    if k < 0:
        return 0
    if k == 0:
        return 1
    # We can find most numbers using combinatorics.
    digits = str(k)
    num_digits = len(digits)
    first_digit = int(digits[0])
    span = 10 ** (num_digits - 1)
    s = (first_digit - 1) * variation_without_repetition(9, num_digits - 1)
    # We must take care of a lower interval regarding leading zeros.
    s += count_numbers_with_non_repeating_digits(span - 1)
    # We continue our search for the upper part.
    used_digits = {first_digit}
    t = num_digits == 1
    for i in range(1, num_digits):
        first_digit = int(digits[i])
        allowed_digits = set(range(first_digit + 1)) - used_digits
        v = variation_without_repetition(9 - i, num_digits - 1 - i)
        used_digits.add(first_digit)
        if first_digit not in allowed_digits:
            if len(allowed_digits) == 0 and i == 1:
                t = 0
            else:
                t += len(allowed_digits) * v
            break
        else:
            t += (len(allowed_digits) - (i != num_digits - 1)) * v
    return s + t
print(count_numbers_with_non_repeating_digits(b) -
      count_numbers_with_non_repeating_digits(a - 1))

Try to fully understand how this program works. Apparently, any aggressive performance tuning thwarts the maintainability of your code base. Nonetheless, if you can exponentially speed up your algorithm, that would displace the necessity to utilize expensive and convoluted distributed solutions.

Bonus: Can you generalize the program to handle numbers in an arbitrary base (for example, base 16)? This shouldn’t be hard once you have figured out the preceding code. This is the point where the former naive algorithm would immediately break (the search space dramatically increases in bases larger than 10).

Exercise 11-2. Data Streaming

Suppose you need to produce a histogram of a data stream (i.e., depict the frequency distribution of values in a stream). This sounds like just another counting task. After all, it is simply a matter of using Numpy’s histogram function for an array of values. The only catch is that the stream is infinite, and values are coming continuously. There is no way to store them in memory. This is another scenario in which you need to change a Cynefin domain from Simple to Complicated.

One very popular algorithm for achieving this is called Count Sketch. This is a probabilistic Monte Carlo–style data structure. The idea is to use a 2D array and a set of universal hash functions to count items. Based on the values of counters, you can estimate the frequencies. For more information, read http://bit.ly/count-sketch .

Implement a Count-Sketch data structure or find an open-source project and try out this approach. You can simulate a stream using Python’s generator function.

Exercise 11-3. Examining Chaos

The Chaos domain requires instantaneous actions. A good example is an anomaly detection system that must react as soon as it detects an abnormal behavior. The KDD Cup 1999 Data is perfect to practice building an automated intrusion detection facility. In this exercise you will do something lighter, but equally instructive.

Hook’s law in physics models the force instigated by a spring with some elasticity k when you stretch it by some distance. The formula is F =  − k∆x, where the minus sign denotes an opposite force in relation to the direction of stretching. The spring constant k is unique for each spring.

Suppose that you have collected a set of readings of a force for various ∆x displacements. Our model of a spring will be correct as far as the spring is operational. Unfortunately, if you stretch the spring too much, you can damage it. From that moment on you will no longer be able to use the model.

Implement an anomaly detection application for a spring. Your system should prerecord data for regular operational conditions to be able to discern anomalies. You may want to take a look at the sklearn.neighbors.LocalOutlierFactor class. When the spring is broken, it will not generate data points near to those from the normal regime. This is the opportunity for your classifier to trigger a warning. You should understand that false positives (invalid alarms) will occur. After all, maybe during normal usage nobody ever tried to push the spring to its limits. Even though nothing bad happened, your system may think that something went wrong. This is the essential property of the Chaos domain. You don’t have too much time to think before you act.

Summary

Knowing when a particular approach is proper and admissible is the cornerstone of being a successful data scientist. You must resist the hype and temptation to build your product around a novel technology for the sake of trying things out. Algorithms are at the heart of dealing with Big Data, together with mathematical statistics. This chapter has illustrated how the Cynefin framework provides context for better decision-making and choosing the right solution.

References

  1. 1.

    George Pólya, How to Solve It: A New Aspect of Mathematical Method, Expanded Princeton Science Library Edition, Princeton University Press, 2004.

     
  2. 2.

    Wattenberg, et al., “How to Use t-SNE Effectively,” Distill, https://distill.pub/2016/misread-tsne , Oct. 3, 2016.

     
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.212.124