Chapter 2. Data Engineering: Fundamentals

The rise of machine learning in recent decades is tightly coupled with the rise of big data. Big data systems, even without machine learning, are complex. If you haven’t spent years and years working with them, it’s easy to get lost in acronyms. There are many challenges and possible solutions that these systems generate. Industry standards, if there are any, evolve quickly as new tools come out and the needs of the industry expand, creating a dynamic and ever-changing environment. If you look into the data stack for different tech companies, it might seem like each is doing its own thing.

In this chapter, we’ll cover the basics of data engineering that will, hopefully, give you a steady piece of land to stand on as you explore the landscape for your own needs. It will start with the question: how important data is for building intelligent systems? It will then cover the basics of data engineering. Knowing how to collect, handle, and process an increasingly growing amount of data is essential to people who want to build ML systems in production. If you’re already familiar with data engineering fundamentals, you might want to move directly to Chapter 3 to learn more about how to sample and generate labels to create training data.

Mind vs. Data

Progress in the last decade shows that the success of an ML system depends largely on the data it was trained on. Instead of focusing on improving ML algorithms, most companies focus on managing and improving their data1.

Despite the success of models using massive amounts of data, many are skeptical of the emphasis on data as the way forward. In the last three years, at every academic conference I attended, there were always some debates among famous academics on the power of mind vs. data. Mind might be disguised as inductive biases or intelligent architectural designs. Data might be grouped together with computation since more computation is usually required because more data is involved, and more data tends to lead to more computation.

In theory, you can both pursue intelligent design and leverage large data and computation, but spending time on one often takes time away from another2.

On the mind over data camp, there’s Dr. Judea Pearl, a Turing Award winner best known for his work on causal inference and Bayesian networks. The introduction to his book, “The book of why”, is entitled “Mind over data,” in which he emphasizes: “Data is profoundly dumb.” In one of his more controversial posts on Twitter, he expressed his strong opinion against ML approaches that rely heavily on data and warned that data-centric ML people might be out of job in 3-5 years.

“ML will not be the same in 3-5 years, and ML folks who continue to follow the current data-centric paradigm will find themselves outdated, if not jobless. Take note.”3

There’s also a milder opinion from Dr. Christopher Manning, Director of the Stanford Artificial Intelligence Laboratory, who argued that huge computation and a massive amount of data with a simple learning device create incredibly bad learners. The structure allows us to design systems that can learn more from fewer data4.

Many people in ML today are on the data over mind camp. Dr. Richard Sutton, a distinguished research scientist at DeepMind and a professor of computing science at the University of Alberta, wrote a great blog post in which he claimed that researchers who chose to pursue intelligent designs over methods that leverage computation will eventually learn a bitter lesson.

“The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. … Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.”5

When asked how Google search was doing so well, Peter Norvig, Google’s Director of Search, emphasized the importance of having a large amount of data over intelligent algorithms in their success: “We don’t have better algorithms. We just have more data.6

Dr. Monica Rogati, Former VP of Data at Jawbone, argued that data lies at the foundation of data science. If you want to use data science, a discipline of which machine learning is a part of, to improve your products or processes, you need to start with building out your data, both in terms of quality and quantity. Without data, there’s no data science.

Figure 2-1. The data science hierarchy of needs (Monica Rogati, 20177)

The debate isn’t about whether finite data is necessary, but whether it’s sufficient. The term finite here is important, because if we had infinite data, we can just look up the answer. Having a lot of data is different from having infinite data.

Regardless of which camp will prove to be right eventually, no one can deny that data is essential, for now. Both the research and industry trends in the recent decades show the success of machine learning relies more and more on the quality and quantity of data. Models are getting bigger and using more data. Back in 2013, people were getting excited when the One Billion Words Benchmark for Language Modeling was released, which contains 0.8 billion tokens8. Six years later, OpenAI’s GPT-2 used a dataset of 10 billion tokens. And another year later, GPT-3 used 500 billion tokens.

Table 2-1. The size of the datasets used for language models over time
Dataset Year Tokens (M)
Penn Treebank 1993 1
Text8 2011 17
One Billion 2013 800
BookCorpus 2015 985
GPT-2 (OpenAI) 2019 10,000
GPT-3 (OpenAI) 2020 500,000
Figure 2-2. The size of the datasets used for language models over time (log scale)

Even though much of the progress in deep learning in the last decade was fueled by an increasingly large amount of data, more data doesn’t always lead to better performance for your model. More data at lower quality might even hurt your model’s performance.

Data Sources

An ML system works with data from many different sources. They have different characteristics with different access patterns, can be used for different purposes, and require different processing methods. Understanding the sources your data comes from can help you use your data more efficiently. This section aims to give a quick overview of different data sources to those unfamiliar with data in production. If you’ve already worked with ML in production for a while, feel free to skip this section.

One source is user input data, data explicitly input by users, which is often the input on which ML models can make predictions. User input can be texts, images, videos, uploaded files, etc. If there is a wrong way for humans to input data, humans are going to do it, and as a result, user input data can be easily mal-formatted. If user input is supposed to be texts, they might be too long or too short. If it’s supposed to be numerical values, users might accidentally enter texts. If you expect users to upload files, they might upload files in the wrong formats. User input data requires more heavy-duty checking and processing. Users also have little patience. In most cases, when we input data, we expect to get results back immediately. Therefore, user input data tends to require fast processing.

Another source is system-generated data. This is the data generated by different components of your systems, which include various types of logs and system outputs such as model predictions.

Logs can record the state of the system and significant events in the system, such as memory usage, number of instances, services called, packages used, etc. It can record the results of different jobs, including large batch jobs for data processing and model training. These types of logs provide visibility into how the system is doing, and the main purpose of this visibility is for debugging and possibly improving the application. Most of the time, you don’t have to look at this type of log, but they are essential when something is on fire.

Because logs are system generated, they are much less likely to be mal-formatted the way users input data is. Most logs don’t require to be processed as fast as user input data. However, you might still want to process your logs fast to be notified whenever something interesting happens9.

Because debugging ML systems is hard, it’s a common practice to log everything you can. This means that your volume of logs can grow very, very quickly. This leads to two problems. The first is that it can be hard to know where to look because signals are lost in the noise. There have been many services that process and analyze logs, such as Logstash, DataDog, Logz, etc. Logs are used to monitor ML systems, but ML is also used to process and analyze logs. In some use cases, logs are used as data to train ML systems to optimize systems’ resource usage.

The second problem is how to store a rapidly growing amount of logs. Luckily, in most cases, you only have to store logs for as long as they are useful, and can discard them when they are no longer relevant for you to debug your current system. If you don’t have to access your logs frequently, they can also be stored in low-access storage that costs much less than higher-frequency-access storage.

System also generates data to record users’ behaviors, such as clicking, choosing a suggestion, scrolling, zooming, ignoring a popup, or spending an unusual amount of time on certain pages. Even though this is system-generated data, it’s still considered part of user data10 and might be subject to privacy regulations. This kind of data can also be used for ML systems to make predictions and to train their future versions.

There are also internal databases, generated by various services and enterprise applications in a company. These databases manage their assets such as inventory, customer relationship, users, and more. This kind of data can be used by ML models directly or by various components of an ML system. For example, when users enter a search query on Airbnb, one or more ML models will return a list of properties that match that query, then Airbnb will need to check with their internal databases for ratings of these properties to rank the results.

Then there’s the wonderfully weird word of third-party data that, to many, is riddled with privacy concerns. First-party data is the data that your company already collects about your users or customers. Second-party data is the data collected by another company on their own customers. Third-party data companies collect data on the public who aren’t their customers.

The rise of the Internet and smartphones has made it much easier for all types of data to be collected. It’s especially easy with smartphones since each phone has a Mobile Advertiser ID, which acts as a unique ID to aggregate all activities on a phone. Data from apps, websites, check-in services, etc. are collected and (hopefully) anonymized to generate activity history for each person.

You can buy all types of data such as social media activities, purchase history, web browsing habits, car rentals, political leaning for different demographic groups getting as granular as men, age 25-34, working in tech, living in the Bay Area. From this data, you can infer information such as people who like brand A also like brand B. This data can be especially helpful for systems such as recommendation systems to generate results relevant to users’ interests. Third-party data is usually sold as structured data after being cleaned and processed by vendors.

Data Formats

Once you have data, you might want to store it (or “persist” it, in technical terms). Since your data comes from multiple sources with different access patterns, storing your data isn’t always straightforward and can be costly. Some of the questions you might want to consider are: How do I store multimodal data? When each sample might contain both images and texts? Where to store your data so that it’s cheap and still fast to access? How to store complex models so that they can be loaded and run correctly on different hardware?

The process of converting a data structure or object state into a format that can be stored or transmitted and reconstructed later is data serialization. There are many, many data serialization formats. Table 2-2 consists of just a few of the common formats that you might work with. For a more comprehensive list, check out the wonderful Wikipedia page Comparison of data-serialization formats.

Table 2-2. Common data formats and where they are used
Format Binary/Text Human-readable? Example use cases
JSON Text Yes Everywhere
CSV Text Yes Everywhere
Parquet Binary No Hadoop, Amazon Redshift
Avro Binary primary No Hadoop
Protobuf Binary primary No Google, TensorFlow (TFRecord)
Pickle Text, binary No Python, PyTorch serialization

We’ll go over a few of these formats, starting with JSON.

JSON

JSON, JavaScript Object Notation, is everywhere. Even though it was derived from JavaScript, it’s language-independent—most modern programming languages can generate and parse JSON. It’s human-readable. Its key-value pair paradigm is simple but powerful, capable of handling data of different levels of structuredness. For example, your data can be stored in a structured format like the following.

{
  "firstName": "Boatie",
  "lastName": "McBoatFace",
  "isVibing": true,
  "age": 12,
  "address": {
    "streetAddress": "12 Ocean Drive",
    "city": "Port Royal",
    "postalCode": "10021-3100"
  }
}

The same data can also be stored in an unstructured blob of text like the following.

{
  "text": "Boatie McBoatFace, aged 12, is vibing, at 12 Ocean Drive,
  Port Royal, 10021-3100"
}

Row-major vs. Column-major Format

The two formats that are common and represent two distinct paradigms are CSV and Parquet. CSV is row-major, which means consecutive elements in a row are stored next to each other in memory. Parquet is column-major, which means consecutive elements in a column are stored next to each other.

Because modern computers process sequential data more efficiently than non-sequential data, if a table is row-major, accessing its rows will be faster than accessing its columns in expectation. This means that for row-major formats, accessing data by rows is expected to be faster than accessing data by columns.

Imagine we have a dataset of 1000 examples, each example has 10 features. If we consider each example as a row and each feature as a column, then the row-major formats like CSV are better for accessing examples, e.g. accessing all the examples collected last week. Column-major formats like Parquet are better for accessing features, e.g. accessing the timestamps of all your examples.

Figure 2-3. Row-major vs. column-major formats

Column-major formats allow flexible data reads, especially if your data is large with thousands, if not millions, of features. Consider if you have data about ride-sharing transactions that has 1000 features but you only want 4 features: time, location, distance, price. With column-major formats, you can read the 4 columns corresponding to these 4 features directly. However, with row-major formats, if you don’t know the sizes of the rows, you will have to read in all columns then filtering down to these 4 columns. Even if you know the sizes of the rows, it can still be slow as you’ll have to jump around the memory, unable to take advantage of caching.

Row-major formats allow faster data writes. Consider the situation when you have to keep adding new individual examples to your data. For each individual example, it’d be much faster to write it to a file that your data is already in a row-major format.

Overall, row-major formats are better when you have to do a lot of writes, whereas column-major ones are better when you have to do a lot of reads.

Text vs. Binary Format

CSV and JSON are text files whereas Parquet files are binary files. Text files are files that are in plain texts, which usually mean they are human-readable. Binary files, as the name suggests, are files that contain 0’s and 1’s, and meant to be read or used by programs that know how to interpret the raw bytes. A program has to know exactly how the data inside the binary file is laid out to make use of the file. If you open text files in your text editors (e.g. VSCode, Notepad), you’ll be able to read the texts in them. If you open a binary file in your text editors, you’ll see blocks of numbers, likely in hexadecimal values, for corresponding bytes of the file.

Binary files are more compact. Here’s a simple example to show how binary files can save space compared to text files. Consider you want to store the number 1000000. If you store it in a text file, it’ll require 7 characters, and if each character is 1 byte, it’ll require 7 bytes. If you store it in a binary file as int32, it’ll take only 32 bits or 4 bytes.

As an illustration, I use interviews.csv, which is a CSV file (text format) of 17,654 rows and 10 columns. When I converted it to a binary format (Parquet), the file size went from 14MB to 6MB.

AWS recommends using the Parquet format because “the Parquet format is up to 2x faster to unload and consumes up to 6x less storage in Amazon S3, compared to text formats.”12

Data Processing

In this section, we will cover the basics of data processing, starting with two major types of processing: transaction processing and analytical processing, and their uses. We will then cover the basics of the ETL process that you will inevitably encounter when building an ML system in production. When dealing with a large amount of data, a question that often comes up is whether you want to store your data as structured or unstructured. In the last part of this section, we will discuss the pros and cons of both formats.

OLTP vs. OLAP

Systems in production generate data. To process online data, there are two types of queries: OnLine Transaction Processing (OLTP) and OnLine Analytical Processing (OLAP). Even though their acronyms look similar, they have distinct characteristics and have distinct underlying architectures. This section gives a shallow overview of OLTP and OLAP. If you’re already familiar with this, feel free to skip it.

Imagine you’re running a consumer application that generates many short transactions within a short amount of time, such as food ordering, online shopping, ridesharing, money transferring. You want to process and store these transactions as they are generated. They need to be processed fast, in the order of milliseconds. The processing method needs to have extremely high availability, because, without a way to record transactions, you won’t be able to serve your users. On top of that, the processing needs to satisfy the ACID (Atomicity, Consistency, Isolation, Durability) requirements:

  • Atomicity: to guarantee that all the steps in a transaction are completed successfully as a group. If any steps between the transaction fail, all other steps must fail also. For example, if a user’s payment fails, you don’t want to still assign a driver to that user.

  • Consistency: to guarantee that all the transactions coming through must follow predefined rules. For example, a transaction must be made by a valid user.

  • Isolation: to guarantee that two transactions happen at the same time as if they were isolated. Two users accessing the same data won’t change it at the same time. For example, you don’t want two users to book the same driver at the same time.

  • Durability: to guarantee that once a transaction has been committed, it will remain committed even in the case of a system failure. For example, after you’ve ordered a ride and your phone dies, you still want your ride to come.

OLTP databases are designed to process online transactions and satisfy all those requirements. Most of the operations they do will be inserting, deleting, and updating an existing transaction. This means that most OLTP databases are more row-oriented.

Because OLTP databases are more row-oriented, they aren’t good for questions such as “What’s the average price for all the rides in September in San Francisco?”. This kind of analytical question requires aggregating data in columns across multiple rows of data. OLAP databases are designed for this purpose. They are efficient with queries that allow you to look at data from different viewpoints.

ETL: Extract, Transform, Load

OLTP databases can be processed and aggregated to generate OLAP databases through a process called ETL (extract, transform, load).

Extract is extracting the data you want from data source(s). Your data will likely come from multiple sources in different formats. Some of them will be corrupted or malformatted. In the extracting phase, you need to validate your data and reject the data that doesn’t meet your requirements. For rejected data, you might have to notify the sources. Since this is the first step of the process, doing it correctly can save you a lot of time downstream.

Transform is the meaty part of the process, where most of the data processing is done. You might want to join data from multiple sources and clean it. You might want to standardize the value ranges (e.g. one data source might use “Male” and “Female” for genders, but another uses “M” and “F” or “1” and “2”). You can apply operations such as transposing, deduplicating, sorting, aggregating, deriving new features, more data validating, etc..

Load is deciding how and how often to load your transformed data into the target destination, which can be a file, a database, or a data warehouse.

The idea of ETL sounds simple but powerful, and it’s the underlying structure of the data layer at many organizations.

Figure 2-4. An overview of the ETL process

Structured vs. unstructured data

Structured data is data that follows a predefined data model, also known as a data schema. For example, the data model might specify that each data item consists of two values: the first value, “name”, is a string at most 50 characters, and the second value, “age”, is an 8-bit integer in the range between 0 and 200. The predefined structure makes your data easier to analyze. If you want to know the average age of people in the database, all you have to do is to extract all the age values and get their mean.

The disadvantage of structured data is that you have to commit your data to a predefined schema. If your schema changes, you’ll have to retrospectively update all your data and/or the changes will cause mysterious bugs. For example, you’ve never kept your users’ email addresses before but now you do, so you have to retrospectively update email information to all previous users. One of the strangest bugs one of my colleagues encountered was when they could no longer use users’ age with their transactions, and their data schema replaced all the null age with 0, and their ML model thought the transactions were made by people of 0 years old.

Because business requirements change over time, committing to a predefined data schema can become too restricting. Or you might have data from multiple data sources, many of the sources are beyond your control, and it’s impossible to make them follow the same schema. This is where unstructured data becomes appealing. Unstructured data is data that doesn’t adhere to a predefined data schema. It’s usually text but can also be numbers, dates, etc. For example, a text file of logs generated by your ML model is unstructured data.

Even though unstructured data doesn’t adhere to a schema, it might still contain intrinsic patterns that help you extract structures. For example, the following text is unstructured, but you can notice the pattern that each line contains two values separated by a comma, the first value is textual and the second value is numerical. However, there is no guarantee that all lines must follow this format. You can add a new line to that text even if that line doesn’t follow this format.

“Lisa, 43
Jack, 23
Nguyen, 59”

Unstructured data also allows for more flexible storage options. For example, if your storage follows a schema, you can only store data following that schema. But if your storage doesn’t follow a schema, you can store any type of data. You can convert all your data, regardless of types and formats into bytestrings and store them together.

A repository for storing structured data is called a data warehouse. A repository for storing unstructured data is called a data lake. Data lakes are usually used to store raw data before processing. Data warehouses are used to store data that have been processed into formats ready to be used.

ETL to ELT

When the Internet first became ubiquitous and hardware had just become so much more powerful, collecting data suddenly became so much easier. The amount of data grew rapidly. Not only that, but the nature of data also changed. The number of data sources expanded, and data schemas evolved.

Finding it difficult to keep data structured, some companies had this idea: “Why not just store all data in a data lake so we don’t have to deal with schema changes? Whichever application needs data can just pull out raw data from there and process it.” This process of loading data into storage first then processing it later is sometimes called ELT (extract, load, transform). This paradigm allows for the fast arrival of data since there’s little processing needed before data is stored.

However, as data keeps on growing, this idea becomes less attractive. It’s expensive to store everything, and it’s inefficient to search through a massive amount of raw data for the piece of data that you want. At the same time, as companies switch to running applications on the cloud and infrastructures become standardized, data structures also become standardized. Committing data to a predefined schema becomes more feasible.

Here is a summary of the key differences between structured and unstructured data.

Table 2-3. The key differences between structured and unstructured data
Structured data Unstructured data
Schema clearly defined Data doesn’t have to follow a schema
Easy to search and analyze Fast arrival
Can only handle data with a specific schema Can handle data from any source
Schema changes will cause a lot of troubles No need to worry about schema changes
Stored in data warehouses Stored in data lakes

Summary

In this chapter, we started with the question about the role of data in building intelligent systems. There are still many people who believe that having intelligent algorithms will eventually trump having a large amount of data. However, the success of systems including AlexNet, BERT, GPT showed that the progress of ML in the last decade relies on having access to a large amount of data.

Therefore, it’s important for ML practitioners to know how to manage and process a large amount of data. This chapter covered the fundamentals of data engineering that I wish I knew when I started my ML career, from handling data from different data sources, choosing the right data format, to processing structured and unstructured data. These fundamentals will hopefully help readers become better prepared when facing seemingly overwhelming data in production.

1 More data usually beats better algorithms (Anand Rajaraman, Datawocky 2008)

2 The Bitter Lesson (Richard Sutton, 2019)

3 Tweet by Dr. Judea Pearl (2020)

4 Deep Learning and Innate Priors (Chris Manning vs. Yann LeCun debate).

5 The Bitter Lesson (Richard Sutton, 2019)

6 The Unreasonable Effectiveness of Data (Alon Halevy, Peter Norvig, and Fernando Pereira, Google 2009)

7 The AI Hierarchy of Needs (Monica Rogati, 2017)

8 1 Billion Word Language Model Benchmark (Chelba et al., 2013)

9 “Interesting” in production usually means catastrophic, such as a crash or when your cloud bill hits an astronomical amount.

10 An ML engineer once mentioned to me that his team only used users’ historical product browsing and purchases to make recommendations on what they might like to see next. I responded: “So you don’t use personal data at all?” He looked at me, confused. “If you meant demographic data like users’ age, location then no, we don’t. But I’d say that a person’s browsing and purchasing activities are extremely personal.”

11 For more Pandas quirks, check out just-pandas-things (Chip Huyen, GitHub 2020).

12 Announcing Amazon Redshift data lake export: share data in Apache Parquet format (Amazon AWS 2019).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.84.155