Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

1 Introduction

This chapter covers

What this book is about and its structure
What makes this book different from other books on algorithms
How massive datasets shape the design of algorithms and data structures
How this book can help you design practical algorithms at a workplace
Computer and system architecture fundamentals that make large amounts of data challenging for today’s systems

Since you picked up this book, you might be wondering what the algorithms and data structures for massive datasets are and what makes them different from “normal” algorithms you might have encountered thus far. Does the title of this book imply that the classical algorithms (e.g., binary search, merge sort, quicksort, depth-first search, breadth-first search, and many other fundamental algorithms) as well as canonical data structures (e.g., arrays, matrices, hash tables, binary search trees, heaps) were built exclusively for small datasets?

The answer to this question is neither short nor simple, but if it had to be short and simple, it would be “yes.” The notion of what constitutes a massive dataset is relative and depends on many factors, but the fact of the matter is that most bread-and-butter algorithms and data structures that we know about and work with on a daily basis have been developed with an implicit assumption that all data fits in the main memory, or random-access memory (RAM) of a computer. So, once you load all your data into RAM, it is relatively fast and easy to access any element of it, at which point the ultimate goal, from the efficiency point of view, becomes achieving the greatest productivity in the fewest number of CPU cycles. This is what the good old Big-O analysis (O(.)) teaches us about: it commonly expresses the worst-case number of basic operations the algorithm has to perform in order to solve a problem. These unit operations can be comparisons, arithmetic, bit operations, memory cell read/write/copy, or anything that directly translates into a small number of CPU cycles.

However, if you are a data scientist, a developer, or a backend engineer working for a company that collects data from its users, storing all data into the working memory of your computer is often infeasible. Many applications today, such as banking, e-commerce, scientific applications, and the Internet of Things (IoT), routinely manipulate datasets of terabyte (TB) or petabyte (PB) sizes (i.e., you don’t have to work for Facebook or Google to encounter massive data at work).

You might be asking yourself how large the dataset has to be for someone to benefit from the techniques shown in this book. We deliberately avoid putting a number on what constitutes a massive dataset or what a “big-data company” is, as it depends on the problem being solved, the computational resources available to the engineer, system requirements, and so forth. Some companies with enormous datasets also have copious resources and can afford to delay thinking creatively about scalability issues by investing in the infrastructure (e.g., by buying tons of RAM). A developer working with moderately large datasets, but with a limited budget for the infrastructure, and extremely high system performance requirements from their client, can benefit from the techniques shown in this book as much as anyone else. Yet, as we will see, even the companies with virtually infinite resources choose to fill that extra RAM with clever space-efficient data structures.

The problem of massive data has been around for much longer than social networks and the internet. One of the first papers [1] to introduce external-memory algorithms (a class of algorithms that neglect the computational cost of the program in favor of optimizing far more time-consuming data-transfer cost) appeared back in 1988. As the practical motivation for the research, the authors use the example of large banks having to sort 2 million checks daily, about 800 MB worth of checks to be sorted overnight before the next business day, using the working memories of that time (~2-4 MB). Figuring out how to sort all the checks while being able to sort only 4 MB worth of checks at one time, and figuring out how to do so with the smallest number of trips to disk, was a relevant problem back then, and it has only grown in relevance since. Data has grown tremendously since then, but more importantly, it has grown at a much faster rate than the average size of RAM memory.

The main consequence of the rapid growth of data, and the main idea motivating algorithms in this book, is that most applications today are data intensive. Data intensive (in contrast to CPU intensive) means that the bottleneck of the application comes from transferring data back and forth and accessing data, rather than doing computations on that data once it’s available. This fact is central to designing algorithms for large datasets, and it is from there that ideas of succinct data structures and external memory-oriented algorithms stem from. In section 1.4, we will delve into more details as to why data access in a computer is much slower than the computation.

The picture only gets more complex as we zoom out of the view of a single computer. Most applications today are distributed and complex data pipelines, with thousands of computers exchanging data over networks. Databases and caches are distributed, and many users simultaneously add and query large amounts of content. Data formats have become diverse, multidimensional, and dynamic. In order to be effective, the applications need to respond to changes very quickly.

In streaming applications [2], data effectively flies by without ever being stored, and the application needs to capture the relevant features of the data with a degree of accuracy that renders it relevant and useful, without scanning it again. This new context calls for a new generation of algorithms and data structures, a new application builder’s toolbox that is optimized to address many challenges specific to massive data systems. The intention of this book is to teach you exactly that—the fundamental algorithmic techniques and data structures for developing scalable applications.

1.1 An example

To illustrate the main themes of this book, consider the following example: you are working for a media company on a project related to news article comments. You are given a large repository of comments with the following associated basic metadata information:

{
    comment-id: 2833908010
    article-id: 779284
    user-id: 9153647
    text: this recipe needs more butter   
    views: 14375
    likes: 43 
}

You are looking at approximately 3 billion user comments totaling 600 GB in data size. Some of the questions you would like to answer about the dataset include determining the most popular comments and articles, classifying articles according to themes and common keywords occurring in the comments, and so on. But first we need to address the issue of duplicates that accrued over multiple instances of scraping and ascertain the total number of distinct comments in the dataset.

1.1.1 An example: How to solve it

A common way to store unique elements in a data structure is to create a key-value dictionary where each distinct element’s unique ID is mapped to its frequency. There are many libraries that implement key-value dictionaries, such as map in C++, HashMap in Java, dict in Python, and so on. Key-value dictionaries are commonly implemented either as a balanced binary tree (e.g., a red-black tree in C++’s map), or, alternatively, as hash tables (e.g., Python’s dict.)

The tree dictionary implementations, apart from lookup/insert/delete that run in fast logarithmic time, offer equally fast predecessor/successor operations, that is, the ability to explore data back and forth efficiently using lexicographical ordering. Most hash table implementations lack the ability to efficiently traverse the items in the lexicographical order; however, the hash table implementations offer fast constant-time performance on most common operations of lookup/insert/delete.

For simplicity of our example, let’s assume we are working with Python’s dict, a hash table. Using comment-id as the key and the number of occurrences of that comment-id as the value will help us effectively eliminate duplicates (see the (comment-id -> frequency) dictionary on the left side of figure 1.1).

However, we might need up to 24 GB in order to store <comment-id, frequency> pairs for 3 billion comments, using 8 bytes per pair (4 bytes for comment-id and 4 bytes for frequency). Depending on the method used to implement the underlying hash table, the data structure will need 1.5 or 2 times the space taken for elements for the bookkeeping (empty slots, pointers, etc.), bringing us close to 40 GB.

If we are also to classify articles according to certain topics of interest, we can again employ dictionaries (other methods are possible as well) by constructing a separate dictionary for each topic (e.g., sports, politics, science, etc.), as shown on the right side of figure 1.1. The role of the (article-id -> keyword_frequency) dictionaries here is to count the number of occurrences of topic-related keywords in all the comments; for example, the article with the article-id 745 has 23 politics-related keywords in its associated comments. We pre-filter each comment-id using the large (comment-id -> frequency) dictionary to only account for distinct comments. A single table of this sort can contain dozens of millions of entries, totaling close to 1 GB, and maintaining such hash tables for, say, 30 topics can cost up to 30 GB for data only, and approximately 50 GB in total.

01-01

Figure 1.1 In this example, we build a (comment-id, frequency) hash table to help us store distinct comment-ids with their frequency count. An incoming comment-id 384793 is already contained in the table, and its frequency count is incremented. We also build topic-related hash tables, where for each article, we count the number of times associated keywords appeared in its comments (e.g., in the sports theme, keywords might be “soccer,” “player,” “goal,” etc.). For a large dataset of 3 billion comments, these data structures may require from dozens to a hundred gigabytes of RAM memory.

We hope this small example demonstrates that we can start from a fairly common and naïve problem, and before we know it, catch ourselves with a number of clunky data structures that we can’t fit into memory.

You might think to yourself, can’t we multiply a couple of numbers beforehand and easily predict how large the data structures are going to become? Well, in real life it often does not work like that. Rarely do people start designing their systems from scratch having massive data in mind. Companies often start out by trying to create a system that works, and can later become victims of their own success, where the user base grows rapidly in a short amount of time, and the old system, designed by developers who have left, needs to grapple with this new demanding workload. Most often, parts of the system get redesigned as the need arises.

When the number of items in a dataset becomes large, then every additional bit per item contributes to the burden on the system. Common data structures that are the bread and butter of every software developer can become too large to efficiently work with, and we need more succinct alternatives (see figure 1.2).

01-02

Figure 1.2 Most common data structures, including hash tables, become difficult to store and manage with large amounts of data.

1.1.2 How to solve it, take two: A book walkthrough

With daunting dataset sizes, we are faced with a number of choices. It turns out that if we settle for a small margin of error, we can build a data structure similar to a hash table in functionality, only more compact. There is a family of succinct data structures that comprise part 1 of the book. These are the data structures that use a tiny amount of space to approximate the answers to these common questions:

Membership—Does a comment/user X exist?
Frequency—How many times did user X comment? What is the most popular keyword?
Cardinality—How many truly distinct comments/users do we have?

These data structures use much less space to process a dataset of n items than a hash table (think 1 byte per item or less, versus 8-16 bytes per item in a hash table).

A Bloom filter, which we will discuss in chapter 3, will use eight times less space than the (comment-id -> frequency) hash table and will answer membership queries with about a 2% false positive rate. In this introductory chapter, we avoid getting into the gritty mathematical details of how we arrive at these numbers, but the difference between Bloom filters and hash tables that is worth emphasizing is that Bloom filters do not store the keys (e.g., comment-id) themselves. Bloom filters compute hashes of keys and use them to modify the data structure. Thus, the size of the Bloom filter mainly depends on the number of keys inserted, not their size (or whether it’s a string or a small or a large integer).

Another data structure that we will learn about in chapter 4, Count-min sketch, will use more than 24 times less space than the (comment-id -> frequency) hash table to estimate the frequency of each comment-id, exhibiting a small overestimate in the frequency in over 99% of the cases. We can also use Count-min sketch to replace the (article-id -> keyword_frequency) hash tables and use about 3 MB per topic hash table, which costs about 20 times less than the original scheme.

Lastly, the data structure HyperLogLog from chapter 5 can estimate the cardinality of the set with only 12 KB, exhibiting the error to less than 1% of the true cardinality.

If we further relax the requirements on accuracy for each of these data structures, we can get away with even less space. Because the original dataset still resides on disk, there is also a way to control for an occasional error so that we are not stuck with the false positives; we just need a little extra effort to verify those.

Comment data as a stream

Quite likely, we will encounter the problem of news comments and articles in the context of a fast-moving event stream rather than as a static dataset. Assume that the event here constitutes any modification to the dataset, such as clicking Like or inserting/deleting a comment or an article, and the events arrive in real time as streaming data to our system. You will learn more about this streaming data context in chapter 6.

Note that in this setup we can also encounter duplicates of comment-id, but for a different reason: every time someone clicks Like on a particular comment, we receive the event with the same comment-id but with an amended count on the likes attribute. Given that events arrive rapidly and on a 24/7 basis and that we cannot afford to store all of them, for many problems of interest we can only provide approximate solutions. Mainly, we are interested in computing basic statistics on data in real time (e.g., the average number of likes per comment in the past week), and without the ability to store the like count for each comment, we can resort to random sampling.

We could draw a random sample from the data stream as it arrives using the Bernoulli sampling algorithm that we cover in chapter 7. To illustrate, if you have ever plucked flower petals in the love-fortune game “(S)he loves me, (s)he loves me not” in a random manner, you could say that you probably ended up with “Bernoulli-sampled” petals in your hand (do not use this on a date). This sampling scheme offers itself conveniently to the one-passover data context.

Answering some more granular questions about the comments data, such as how many likes a comment needs to be in the top 10% of liked comments, will also trade accuracy for space. We can maintain a type of a dynamic histogram (see chapter 8) of the complete viewed data within a limited, realistic, fast-memory space. This sketch, or a summary of the data, can then be used to answer queries about any quantiles of our complete data with some error.

Comment data in a database

Lastly, we might want to store all comment data in a persistent format (e.g., a database on disk/cloud), and build a system on top that would enable the fast insertion, retrieval, and modification of live data over time. In this kind of setup, we favor accuracy over speed, so we are comfortable storing tons of data on disk and retrieving it in a slower manner, as long as we can guarantee queries will have 100% accuracy.

Storing data on remote storage and organizing it so that it lends itself to efficient retrieval is a topic of the algorithmic paradigm called external-memory algorithms, which we will begin to explore in chapter 9. External-memory algorithms address some of the most relevant problems of modern applications, such as the design and implementation of database engines and their indices. In our particular comments data example, we need to ask whether we are building a system with mostly static data, but that is constantly queried by users (i.e., read optimized), or a system where users frequently add new data and modify it, but query it only occasionally (i.e., write optimized ). Or, perhaps, it is a combination, where both fast inserts and fast queries are equally important (i.e., read-write optimized).

Very few engineers implement their own storage engines, but almost all of them use them. To knowledgeably choose between different alternatives, we need to understand what data structures power them underneath. The insert/lookup tradeoff is inherent in databases, and it is reflected in the design of data structures that run underneath MySQL, TokuDB, LevelDB, and many other storage engines out there. Some of the most popular data structures to build databases include B-trees, B ^ε-trees, and LSM-trees, and each serves a different sort of a workload. We will discuss these different types of performance and the tradeoffs in chapter 10. Also, we may be interested in solving other problems with data sitting on disk, such as ordering comments lexicographically or by a number of occurrences. To do that, we need a sorting algorithm that will efficiently sort data in a database or in a file on disk. You will learn how to do that in the last chapter of our book, chapter 11.

1.2 The structure of this book

As the earlier section outlines, this book revolves around three main themes, divided into three parts.

Part 1 (chapters 2-5) deals with hash-based sketching data structures. This part begins with a review of hash tables and specific hashing techniques developed for massive data setting. Even though the hashing chapter is planned as a review chapter, we suggest you use it as a refresher on hashing, and also use the opportunity to learn about modern hash techniques devised to deal with large datasets. Chapter 2 also serves as good preparation for chapters 3-5 considering that the sketches are hash based. The data structures we present in chapters 3-5, such as Bloom filters, Count-min sketch, Hyperloglog, and their alternatives, have found numerous applications in databases, networking, and so on.

Part 2 (chapters 6-8) introduces data streams. From classical techniques, such as Bernoulli and reservoir sampling, to more sophisticated methods, such as sampling from a moving window, we introduce a number of sampling algorithms suitable for different streaming data models. The created samples are then used to calculate estimates of the total sums or averages, and so on. We also introduce algorithms for calculating (ensemble of) ε-approximate quantiles such as q-digest and t-digest.

Part 3 (chapters 9-11) covers algorithmic techniques for scenarios when data resides on SSD/disk. First we introduce the external-memory model and then present optimal algorithms for fundamental problems such as searching and sorting, illuminating key algorithmic tricks in this model. This part of the book also covers data structures that power modern databases such as B-trees, B^ε-trees, and LSM-trees.

1.3 What makes this book different and whom it is for

There are a number of great books on classical algorithms and data structures, including The Algorithm Design Manual (3rd ed.) by Skiena (Springer, 2020); Introduction to Algorithms (3rd ed.) by Cormen, Leiserson, Rivest, and Stein (The MIT Press, 2022); Algorithms (4th ed.) by Sedgewick and Wayne (Addison-Wesley, 2011); and, for a more introductory and friendly take on the subject, Grokking Algorithms by Bhargava (Manning, 2016). The algorithms and data structures for massive datasets are slowly making their way into mainstream textbooks, but the world is moving fast, and our hope is that our book can provide a compendium of the state-of-the-art algorithms and data structures that can help a data scientist or a developer handling large datasets at work.

The book is intended to offer a good balance of theoretical intuition, practical use cases, and code snippets in Python. We assume that a reader has some fundamental knowledge of algorithms and data structures, so if you have not studied the basic algorithms and data structures, you should cover that material before embarking on this subject. Massive data algorithms are a very broad subject, and this book is meant to serve as a gentle introduction.

The majority of the books on massive data focus on a particular technology, system, or infrastructure. This book does not focus on the specific technology; neither does it assume familiarity with any particular technology. Instead, it covers underlying algorithms and data structures that play a major role in making these systems scalable.

Often, the books that do cover algorithmic aspects of massive data focus on machine learning. However, an important aspect of handling large data that does not specifically deal with inferring meaning from data, but rather has to do with handling the size of the data and processing it efficiently, whatever the data is, has often been neglected in the literature. This book aims to fill that gap.

There are some excellent books that address specialized aspects of massive datasets [3]. With this book, we intend to present these different themes in one place, often citing the cutting-edge research and technical papers on relevant subjects. Lastly, our hope is that this book will teach a more advanced algorithmic material in a down-to-earth manner, providing mathematical intuition instead of the technical proofs that characterize most resources on this subject. Illustrations play an important role in communicating some of the more advanced technical concepts, and we hope you enjoy them (and learn from them).

Now that the introductory remarks are out of the way, let’s discuss the central issue that motivates topics from this book.

1.4 Why is massive data so challenging for today’s systems?

There are many parameters in computers and distributed systems architecture that can shape the performance of a given application. Some of the main challenges that computers face in processing large amounts of data stem from hardware and general computer architecture. Now, this book is not about hardware, but in order to design efficient algorithms for massive data, it is important to understand some physical constraints that are making data transfer such a big challenge. Some of the main issues we discuss in this chapter include the large asymmetry between the CPU and the memory speed, different levels of memory and the tradeoffs between the speed and size for each level, and the issue of latency versus bandwidth.

1.4.1 The CPU memory performance gap

The first important asymmetry that we will discuss is between the speeds of CPU operations and memory access operations in a computer, also known as the CPU memory performance gap [4]. Figure 1.3 shows, starting from 1980, the average gap between the speeds of processor memory access and main memory access (DRAM memory), expressed in the number of memory requests per second (the inverse of latency).

01-03

Figure 1.3 CPU memory performance gap graph, adopted from Hennessy & Patterson’s computer architecture. The graph shows the widening gap between the speeds of memory accesses to CPU and RAM main memory (the average number of memory accesses per second over time.) The vertical axis is on the log scale. Processors show an improvement of about 1.5 times per year up to year 2005, while the improvement of access to main memory has been only about 1.1 times per year. Processor speed-up has somewhat flattened since 2005, but this is being alleviated by using multiple cores and parallelism.

Intuitively, this gap shows that performing computations is much faster than accessing data. If we are stuck with the mindset that only cares about optimizing CPU computation, then in many cases our analyses will not jive well with reality.

1.4.2 Memory hierarchy

Aside from the CPU memory gap, there is a hierarchy of different types of memory built into a computer that have different characteristics. The overarching tradeoff has been that the memory that is fast is also small (and expensive), and the memory that is large is also slow (but cheap). As shown in figure 1.4, starting from the smallest and the fastest, the computer hierarchy usually contains the following levels: registers, level 1 cache, level 2 cache, level 3 cache, main memory, solid state drive (SSD), and/or the hard disk (HDD). The last two are persistent (nonvolatile) memories, meaning the data is saved if we turn off the computer, and as such are suitable for storage.

01-04

Figure 1.4 Different types of memories in the computer. Starting from registers in the bottom left corner, which are blindingly fast but also very small, we move up (getting slower) and right (getting larger) with level 1 cache, level 2 cache, level 3 cache, and main memory, all the way to SSD and/or HDD. Mixing up different memories in the same computer allows for the illusion of having both the speed and the storage capacity by having each level serve as a cache for the next larger one.

In figure 1.4, we can see the access times and capacities for each level of the memory in a sample architecture [5]. The numbers vary across architectures and are more useful when observed in terms of ratios between different access times rather than the specific values. For example, pulling a piece of data from cache is roughly 1 million times faster than doing so from the disk.

The hard disk and the needle, some of the few remaining mechanical parts of a computer, work a lot like a record player. Placing the mechanical needle on the right track is the time-consuming part of accessing disk data. Once the needle is on the right track, the data transfer can be very fast, depending on how fast the disk spins.

1.4.3 Latency vs. bandwidth

A similar phenomenon is where “latency lags bandwidth” [6] and holds for different types of memory. The bandwidth in various systems, ranging from microprocessors to main memory, hard disk, and network, has tremendously improved over the past few decades, but latency hasn’t improved at the same rate, even though latency is the important measurement in many scenarios where common user behavior involves many small random accesses as opposed to one large sequential one.

To offset the cost of the expensive initial call, data transfer between different levels of memory is done in chunks of multiple items. Those chunks are called cache lines, pages, or blocks, depending on the memory level we are working with, and their size is proportionate to the size of the corresponding level of memory; for cache they are in the range 8-64 bytes, and for disk blocks they can be up to 1 MB [7]. Due to the concept known as spatial locality, where we expect the program to access memory locations that are in the vicinity of each other and close in time, transferring data in sequential blocks effectively pre-fetches the items we will likely need in the near future.

1.4.4 What about distributed systems?

Most applications today run on multiple computers, and having data sent from one computer to another adds yet another level of delay. Data transfer between computers can be from hundreds of milliseconds to a couple of seconds long, depending on the system load (e.g., number of users accessing the same application), number of hops to destination, and other details of the architecture (see figure 1.5).

01-05

Figure 1.5 Cloud access times can be high due to the network load and complex infrastructure. Accessing the cloud can take hundreds of milliseconds or even seconds. We can observe this as another level of memory that is even larger and slower than the hard disk. Improving the performance in cloud applications can also be hard because times to access or write data on a cloud are unpredictable.

1.5 Designing algorithms with hardware in mind

After looking at some crucial aspects of modern computer architecture, the first important takeaway is that, although technology improves constantly (e.g., SSDs are a relatively new development and they do not share many of the issues of hard disks), some of the issues, such as the tradeoff between the speed and the size of memories, are not going away any time soon. Part of the reason for this is purely physical: to store a lot of data, we need a lot of space, and the speed of light sets the physical limit as to how fast data can travel from one part of the computer to the other, or one part of the network to the other. To extend this to a network of computers, we will cite an example [8] showing that, for two computers 300 meters away from each other, the lower physical limit of data exchange is 1 microsecond.

Hence, we need to design algorithms that can work around hardware limitations. Designing succinct data structures (or taking data samples) that can fit into small and fast levels of memory helps because we avoid expensive disk seeks. In other words, reducing space saves time.

Yet, in many applications we still need to work with data on disk. Here, designing algorithms with optimized patterns of disk access and caching mechanisms that enable the smallest number of memory transfers is important, and this is further linked to how we lay out and organize data on a disk (say, in a relational database). Disk-based algorithms prefer smooth scanning over the disk over random hopping; this way, we get to make use of good bandwidth and avoid poor latency, so one meaningful direction is transforming an algorithm that does many random reads/writes into one that does sequential reads/writes. Throughout this book, we will see how classical algorithms can be transformed and new ones designed, with space-related concerns in mind.

However, it is also important to keep in mind that modern systems have many performance metrics other than scalability: security, availability, maintainability, and so on. Real production systems need an efficient data structure and an algorithm running under the hood, but with a lot of bells and whistles on top to make all the other stuff work for their customers (see figure 1.6). However, with ever-increasing amounts of data, designing efficient data structures and algorithms has become more important than ever before, and we hope that in the coming pages you will learn how to do exactly that.

01-06

Figure 1.6 An efficient data structure with bells and whistles

Summary

Applications today generate and process large amounts of data at a rapid rate. Traditional data structures, such as key-value dictionaries, can grow too big to fit in RAM memory, which can lead to an application choking due to the I/O bottleneck.
To process large datasets efficiently, we can design space-efficient hash-based sketches, do real-time analytics with the help of random sampling and approximate statistics, or deal with data on disk and other remote storage more efficiently.
This book serves as a natural continuation of the basic algorithms and data structures book/course because it teaches you how to transform the fundamental algorithms and data structures into algorithms and data structures that scale well to large datasets.
The key reasons why large data is a major issue for today’s computers and systems are that CPU (and multiprocessor) speeds improve at a much faster rate than memory speeds, and the tradeoff between the speed and size for different types of memory in the computer, as well as the latency versus bandwidth phenomenon, leads to applications processing data at a slower rate than performing computations. These trends are not likely to change soon, so the algorithms and data structure that address the I/O cost and issues of space are only going to increase in importance over time.
In data-intensive applications, optimizing for space means optimizing for time.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 1 Introduction

Create new playlist

Sign In

Sign Up