Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 1. Introducing Presto

Every company, whether it’s a startup or a large established business, depends on more data than ever before. As a valuable asset in the company, we collect it, produce it, combine it, and analyze it. We’re now past the big data era. There’s no longer a need for adjectives--it’s just a lot of data. In addition to the growing amount, data is more varied than ever before, with different data types ranging from structured tables to unstructured objects and files. With cloud and edge computing becoming the norm, data is being created in more places. Along with needing to store large amounts of data in many systems comes the need for scalable tools that can query data faster and with a common language, like Structured Query Language (SQL).

Along with having a lot of data in a myriad of forms from a broad spectrum of locations, we are now expected to use it to become data-driven. By exploring and making sense of all the data with SQL, you can extract valuable insights to make better decisions. Backed by data, companies can innovate new ways of interacting with their customers or create new products to better meet their customer needs. As a reader of this book, you’ve probably heard that Presto is a fast, flexible distributed SQL engine created and used by Facebook at scale. In this book, you’ll learn about Presto and how to operate Presto to solve your needs for fast insights on data of any type and any size. In Chapter 1, you’ll learn what Presto is, where it came from, and how it is different from other data warehousing solutions. You’ll hear about who uses Presto and the ways they solve data challenges with Presto.

Presto Origins

When you were younger perhaps, “Presto” is what you said when you amazed your family and friends with a magic trick. Today, if you’re a developer, data platform engineer, data analyst or scientist, Presto is better known as the powerful query engine that helps derive insights from data.

In 2012, Facebook needed a way to provide end users access to enormous data sets to perform ad hoc analysis. At the time, Facebook was using Apache Hive¹ to perform this kind of analysis. As Facebook’s data sets grew, Hive was found to not be as interactive (read: too slow) as desired. This is largely because the foundation of Hive is MapReduce, which, at the time, required intermediate data sets to be persisted to disk. This requires a lot of I/O to disk for data for transient, intermediate result sets. So, Facebook developed Presto, a new distributed SQL query engine, designed as an in-memory engine without the need to persist intermediate result sets for a single query. This approach led to a query engine that processed the same query orders of magnitude faster, with many queries completing with less-than-a-second latency. End users such as engineers, product managers, and data analysts found they could interactively query fractions of large data sets to test hypotheses and create visualizations.

In 2013, Facebook open-sourced Presto, opening up the Github repository “prestodb” under the permissive Apache 2.0 license. Early adopters and collaborators included large San Francisco Bay Area companies like Airbnb, Uber, and Netflix. In 2015, Netflix showed that Presto was, in fact, 10 times faster than Hive–and even faster in some cases. Their cluster supported many dozens of concurrent running queries with varying resource requirements. Plotting their real-world production queries against wall clock time revealed that half of their queries ran in less than four seconds. This low-latency trend kept up, with 85% of their queries running in less than a minute. This enabled end users to run ad hoc queries and get answers fast, with iterative exploratory data analysis.

Beyond the early users, hundreds of companies have since presented publicly about using Presto in production. As you’ll read further about Presto’s flexibility and extensibility, Presto can support a variety of SQL use cases. Organizations typically start with a Presto cluster for their interactive, ad hoc queries and then add other diverse use cases which we’ll cover later in this chapter. For example, at Facebook, Presto started with interactive ad hoc analytics, with latencies of less than one second for many hundreds of concurrent queries. Then they added another use case they call A-B testing. After that, they saw that Presto was so efficient that they could apply it to run batch and Extract-Transform-Load (ETL) workloads.

What Is Presto?

Presto is a distributed SQL query engine written in Java. It takes any query written in SQL, analyzes the query, creates and schedules a query plan on a cluster of worker machines which are connected to data sources, and then returns the query results. The query plan may have a number of execution stages depending on the query. For example, if your query is joining together many large tables, it may need multiple stages to execute, aggregating tables together. After each execution stage there may be intermediate data sets. You can think of those intermediate answers like your scratchpad for a long calculus problem.

In the past, distributed query engines like Hive were designed to persist intermediate results to disk. As Figure 1-1 illustrates, Presto saves time by executing the queries in the memory of the worker machines, including performing operations on intermediate datasets there, instead of persisting them to disk. The data can reside in HDFS or any database or any data lake, and Presto performs the executions in-memory across your workers, shuffling data between workers as needed. Avoiding the need for writing and reading from disk between stages ultimately speeds up the query execution time.

Note

If this distributed in-memory model sounds familiar, that’s because Apache Spark uses the same basic concept to effectively replace MapReduce-based technologies. However, Spark and Presto manage stages differently. In Spark, data needs to be fully processed before passing to the next stage. Presto uses a pipeline processing approach and doesn’t need to wait for an entire stage to finish.

Presto was developed with the following design considerations:

High performance with in-memory execution
High scalability from 1 to 1000s of workers
Flexibility to support a wide range of SQL use cases
Highly pluggable architecture that makes it easy to extend Presto with custom integrations for security, event listeners, etc.
Federation of data sources via Presto connectors
Seamless integration with existing SQL systems by adhering to the ANSI SQL standard

“SQL is old, but is not going anywhere. Learn it.”

An aphorism amongst people who work with data, SQL has become the lingua franca of data systems. In 1986, the American National Standards Institute (ANSI) adopted SQL as a standard. This is known as the ANSI SQL standard. Even after more than four decades, SQL has stood the test of time and is gaining in popularity. SQL is the most mainstream way to work with any database. It is easy to learn and provides users with a broad interoperability for the majority of databases. Adherence and support of the ANSI SQL standard has been an important characteristic for a federated system like Presto. While there are variants and extensions for SQL, to be compliant with ANSI SQL means that the major, commonly-used commands, like SELECT, UPDATE, DELETE, INSERT, and WHERE, all operate as you’d expect. This means that SQL code that works on other databases will likely work on Presto, out of the box, without any changes to the SQL. This also means that dozens of popular Business Intelligence (BI) tools work with the Presto engine without any additional integration. As most users already know how to write SQL, Presto is easily accessible, and doesn’t require any further learning. Presto’s SQL compliance immediately enables a large number of use cases.

Decoupled Storage and Compute for the Data Analytics Stack

A traditional data warehouse has multiple layers required for data processing vertically integrated into it. At a high level, this can be broken down into compute and storage. Data warehouses also control how data is written, where that data resides, and how it is read, as shown in Figure 1-2.

Operational databases and other external sources of data are ingested into the data warehouse server’s storage layer, making a second copy which is usually transformed and written in a optimized, but proprietary, format for more efficient analytical processing by the middle tier. Some issues with the vertically integrated data warehouse are the non-linear increase of costs associated with the growing amounts of ingested data and analytical processing. In addition to needing to ingest all data into one system, the data warehouse may not be designed to handle all analytical processing requirements. In contrast, today’s disaggregated systems are designed to scale to handle much larger amounts of data and analytic computation. Presto is one of most popular distributed computing engines. Your data platform team can decouple data storage from data processing, as illustrated in Figure 1-3.

A full deployment of Presto has a coordinator and multiple workers. Queries are submitted to the coordinator by a client like the command line interface (CLI), a BI tool, or a notebook that supports SQL. The coordinator parses, analyzes and creates the optimal query execution plan using metadata and data distribution information. That plan is then distributed to the workers for processing.

The advantage of this decoupled storage model is that Presto is able to provide a single view of all of your data that has been aggregated into the data storage tier like Hadoop Distributed File System (HDFS).

Federation of Multiple Backend Data Sources

Federated database architectures are a concept from the late 1980’s and early 1990’s. This approach is seeing a resurgence due to the greater amount of data, the proliferation of data serialization formats, data lakes and specialized databases (i.e. NoSQL, time-series, metrics, etc.) and end users wanting to access this data on-demand, without necessarily ingesting it all into a common warehouse. Federated databases provide a virtual database abstraction over connected heterogeneous data sources, which can be queried for business insights.

Note

The term federation is a mouthful, but it’s increasingly important in our data technology world, in which we deal with many different databases, data lakes, and a vast ecosystem of relational and non-relational open source data systems available either on-premises and/or in the cloud. The adjective “federated” can be added in front of a “query engine” or a “query” but mean slightly different things:

Federated query engine: A query engine that is mapped to multiple data sources which enables unified access to those systems for either queries to a single data source at a time or for federated queries with multiple data sources (including with JOINs across tables from different data sources). Most users today use Presto to unify access to multiple data sources but query them one at a time. Why? JOINs across sources require the underlying data to be correlated. And those linkages may or may not be available. However, correlation across data sources is the ultimate prize of federation to find insights and patterns on data across multiple sources, which otherwise would not be possible.

Federated query: A single query which stores or retrieves data from multiple different data sources, instead of a single data source.

Presto is a federated query engine that supports pluggable connectors to access data from external data sources and write data to those external data sources -- no matter where they reside. Many data sources are available for integration with Presto.

Presto can connect to each data source and ship some of the query processing down to it, with a filtered set of data pulled back into the workers for further processing. This could even include correlating multiple data sources together with SQL Joins. With this approach, SQL can be used to not only query traditional relational data sources but also non-relational data sources like NoSQL databases (like MongoDB, Elasticsearch) and data lakes (like HDFS, Amazon S3). The data is flattened and made into a tabular form when it is read from non-relational sources and once pulled into Presto, can be then be correlated with other structured data.

Note

These connections are possible because Presto understands SQL, specifically the ANSI SQL standard. Using one standard language helps the project be consistent over time and avoids the need to rewrite queries when you change a backend data source.

As of this writing, there are over two dozen connectors available and this number is growing as the community writes more. These connectors are for databases, object stores, data lakes, streaming data, almost any data system. If you have a data source that currently doesn’t have a connector, you could write one yourself and contribute it back too.

How Presto works

We’ve covered what Presto is, but how does it work? Presto is written in Java, and therefore requires a JDK or JRE to be able to start. It is deployed as two main services, a single coordinator and many workers. The coordinator service is effectively the brain of the operation, receiving query requests from clients, parsing the query, building an execution plan, and then scheduling work to be done across many worker services. Each worker processes a part of the overall query in parallel, and you can add worker services to your Presto deployment to fit your demand. Each data source is configured as a catalog, and you can query as many catalogs as you want in each query.

As shown in Figure 1-6, Presto is accessed through a JDBC driver and integrates with practically any tool that can connect to databases using JDBC. The Presto CLI is often the starting point when beginning to explore Presto. Either way, the client connects to the coordinator to issue a SQL query. That query is parsed and validated by the coordinator, and transformed into a query execution plan. This plan details how a query is going to be executed by the Presto workers. The query plan typically begins with one or more table scans in order to pull data out of your external data sources. This will be explained more in the next section.

Users can query a single data source with Presto. This approach can be particularly useful when querying data lakes like HDFS, Amazon S3, Google Cloud Store and others via the connector that Presto calls the Hive connector. Presto becomes the separated query engine that uses the metadata from an external catalog (configured via the Hive connector) and processes data stored in the data lake.

This approach can also be useful to run analytics on a non-relational data source like mongoDB or Elasticsearch using Presto SQL via BI and reporting tools. This single stack approach means Presto can be a fast query engine for any database and data lake.

Of course, as a federated engine, we see many Presto deployments that are connected to multiple data sources. This allows end users to query a data source at a time, using the same interface without having to switch between systems or think of them as different data systems. For example, you could have a single BI dashboard with 5 graphs with data pulled from 5 different data backends. Federating data sources makes easy work of that task.

One of the most popular connectors, is the Hive connector, that allows users to query data using the same metadata you would use to interact with HDFS or Amazon S3. Using this connector, Presto is able to read data from the same schemas and tables using the same data formats -- ORC, Avro, Parquet, JSON, and more. Because of this connectivity, Presto is a drop-in replacement for organizations using Hive or Spark SQL.

Note

Presto’s Hive connector would have better been named a ‘data lake connector’. Facebook created Presto as a replacement for Hive, so why would there be a connector to the replacement? It’s not, the Hive connector is referring to the Hive metastore. The Hive connector enables Presto to query any data lake via a metadata catalog whether that is a Hive metastore or another catalog like Amazon Glue. The metadata catalog integration is a very important aspect of a disaggregated computational engine because this is what maps files stored in data lakes into databases, tables and columns and allows for SQL to be applied to query files.

In addition to the Hive connector, you’ll find connectors for MySQL, PostgreSQL, Kafka, Elasticsearch, Cassandra, MongoDB, and many others.

Taking federation a step further, a query can combine data from two or more different data sources. For example, you can join data sets and select data between Kafka and MySQL, or between MongoDB and PostgreSQL, or between Amazon S3 and Pinot (Presto with Pinot is discussed in detail in Chapter 2), or between Elasticsearch, S3 and MySQL -- you get the idea. The benefits of doing so allows end users to analyze more data without the need to move or copy data into a single data source. Even though a NoSQL database like MongoDB doesn’t support SQL, Presto nonetheless has a way to query it via the Presto connector for MongoDB. That connector allows the use of MongoDB collections as tables in Presto. Instead of moving the data in MongoDB over to another system and maintaining a data pipeline when there are changes, or you could simply query both systems in place with Presto. So Presto has the advantage of querying across disparate systems which may have different data models.

Connectors are being contributed to the Presto project all the time. If you need one and see that it’s not currently available, you can write your own connector in Java. If you don’t want to write in Java, you can use the Apache Thrift connector and write a Thrift service in Go, Python, JavaScript, or another preferred language. Thrift is a software library and tools to enable efficient and reliable communication between distributed services. Thus, you can integrate almost anything with Presto. As a result, Presto has the potential to become the primary way to perform analytics on all your data

To summarize this section, you can work with data in three different ways, the difference between 2 and 3 is primarily whether or not JOINs are used across systems:

Presto configured with only one data source - gives you fast analytics.
Presto configured with multiple data sources, each queried independently - gives you fast, federated analytics (Figure 1-7, left diagram)
Presto configured with multiple data sources, correlated and queried together - gives you fast, federated, unified analytics (right diagram)

Presto query processing explained

Once a SQL query is received by the coordinator via one of its interfaces (CLI, JDBC etc), the coordinator parses the query into an internal representation called a syntax tree. The analyzer uses this to break the query down into consumable parts in a logical plan.

The query optimizer takes the logical plan and converts it into a physical plan that includes an efficient execution strategy for the query. Presto has a rules-based and a cost-based optimizer.

The data sources connected to Presto may be of varied types and complexity, i.e. they could be simple filesystems, object stores, or highly optimized databases. Presto contains several rules, including well-known optimizations such as predicate and limit pushdown, column pruning, and decorrelation. It can make intelligent choices on how much of the query processing can be pushed down into the data sources, depending on the source’s abilities. For example, some data sources may be able to evaluate predicates, aggregations, function evaluation, etc., and by pushing these operations closer to the data Presto achieves significantly improved performance by minimizing disk I/O and network transfer of data. The remainder of the query, e.g. joining data across different data sources, will be processed by Presto on its workers, and the final result set is transferred to the client via the coordinator.

In addition, the internal engine is columnar, which coincides with the popularity of columnar storage formats like ORC and Parquet, both of which Presto supports. That means that Presto builds columnar blocks and the engine processes those columnar blocks. It’s columnar execution on top of columnar storage. Columnar storage formats store data of a single column together (contiguously). — and so it doesn’t get much better than that. With both the storage format and the execution mechanism oriented towards a columnar approach, the analytic workloads run faster on Presto. Different parts of query processing like scans, aggregation, etc. can be executed faster when columns are grouped together. Thus query operators can quickly loop over all the column values potentially loading some or all of them into the CPU in a single memory reference.

Presto Operations

Presto is a distributed system which forms the compute tier and interfaces with both the data layer and the front end tooling. The operations of such a varied data system can be challenging. Operations of Presto include the configuration, administration, security, tuning and troubleshooting of clusters. More specific to the cluster itself, we’ll discuss high availability options for the single coordinator node, handling of sizing, scaling concerns, connecting new data sources, and scaling concurrency. Then there is the manageability side with maintenance, updates, and fault tolerance. These are all topics which are important to understand in order to successfully deploy Presto which we will cover in future chapters.

Presto at Scale

The Presto workers we mentioned previously allow Presto to run at any scale. A large infrastructure isn’t a requirement though. Presto can be used in small settings or for prototyping before tackling a larger data set. Because of its very low latency, there isn’t major overhead for running small queries as you’d see with a batch-orientated system like Hive. You can start by querying megabytes at first, but the same queries can be run on gigabytes, terabytes or even petabytes as your data size increases over time.

It’s easy to run a demo on a laptop using a container tutorial using a Presto sandbox on DockerHub. Once you start, your Presto deployments can scale out horizontally, even to Facebook levels².

As a distributed worker system, Presto is horizontally scalable, running on commodity servers. If you have a 50 node cluster, you can add another 50 machines and it just becomes a 100 node cluster with all the benefits therein (i.e. higher performance, concurrency, etc.)

Presto in the Cloud

There are also a number of cloud offerings for running Presto. You can run Presto via Amazon Elastic MapReduce (EMR) and Google Dataproc. Amazon Athena, a serverless, interactive query service to query data and analyze big data in Amazon S3 using standard SQL, is built on Presto. Other vendors offer Presto as a managed service, such as Ahana, that make it easier to set up and operate multiple Presto clusters for different use cases.

Presto as an Analytics Platform

Presto is the engine for many different types of analytical workloads to address a variety of problems. As we discussed, the disaggregated stack leverages Presto along with one or more data stores, metastores, and SQL tools. Next, we describe some of the popular uses for Presto as an analytics platform.

Ad hoc querying

Presto was originally designed for interactive analytics, or ad hoc querying. In today’s “Internet era” competitive world and with data systems able to collect granular amounts of data in near-real-time, engineers, analysts, data scientists, and product managers all want the ability to quickly analyze their data and become data-driven organizations making superior decisions, innovating quickly and transforming their businesses for the better.

They either simply type in simple queries by hand or use a range of visualization, dashboarding, and BI tools. Depending on the tools chosen, they can run 10s of complex concurrent queries against a Presto cluster. With Presto connectors and their in-place execution, platform teams can quickly provide access to the data sets that users want. Not only do analysts get access, but also they can run queries in seconds and minutes–instead of hours–with the power of Presto, and they can iterate quickly on innovative hypotheses with the interactive exploration of any data set, residing anywhere.

Reporting and dashboarding

Because of the design and architecture of Presto and its ability to query across multiple sources, Presto is a great backend for reporting and dashboarding. Unlike the first generation static reporting and dashboarding, today’s interactive reporting and dashboards are very different. Analysts, data scientists, product managers, marketers and other users not only want to look at KPI’s, product statistics, telemetry data and other data, but they also want to drill down into specific areas of interest or areas where opportunity may lie. This requires the backend - the underlying system - to be able to process data fast wherever it may sit. To support this type of self-service analytics, platform teams are required to either consolidate data into one system via expensive pipelining approaches or test and support every reporting tool with every database, data lake and data system their end users want to access. Presto gives data scientists, analysts and other users the ability to query data across sources on their own so they’re not dependent on data platform engineers. It also greatly simplifies the task of the data platform engineers by absorbing the integration testing and allowing them to have a single abstraction and end point for a range of reporting and dashboarding tools.

ETL using SQL

Analysts can aggregate terabytes of data across multiple data sources and run efficient ETL queries against that data with Presto. Instead of legacy batch processing systems, Presto can be used to run resource-efficient and high throughput queries. ETL can process all the data in the warehouse; it generates tables that are used for interactive analysis or feeding various downstream products and systems.

Presto as an engine is not an end-to-end ETL system, nor is Hive or Spark. Some additional tools can be easily added to coordinate and manage numerous on-going time-based jobs, a.k.a. cron jobs, which take data from one system and move it into another store, usually with a columnar format. Users can use a workflow management system like open source Apache Airflow or Azkaban. These automate tasks that would normally have to be run manually by a data engineer. Airflow is an open source project that programmatically authors, schedules and monitors ETL workflows, and was built by Airbnb employees who were former Facebook employees. Azkaban, another open source project, is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs.

The queries in batch ETL jobs are much more expensive in terms of data volume and CPU than interactive jobs. As such the clusters tend to be much bigger. So some companies will separate Presto clusters: one for ETL and another one for ad hoc queries. This is operationally advantageous since it is the same Presto technology and requires the same skills. For the former, it’s much more important that the throughput of the entire system is good versus latency for an individual query.

Data lake analytics

Data lakes have grown in popularity along with the rise of Amazon S3-compatible object storage which Amazon AWS has made popular. A data lake enables you to store all your structured and unstructured data as-is and run different types of analytics on it.

A data warehouse is optimized to analyze relational data, typically from transactional systems and business applications, where the data is cleaned, enriched, and transformed. A data lake is different from a data warehouse in that it can store all your data–the relational data as well as non-relational data from a variety of sources, such as from mobile apps, social media, time-series data–and you can derive more new insights from the analysis of that broader data set. Again you can do so without necessarily needing to process that data beforehand.

Presto is used to query data directly on a data lake without the need for transformation. You can query any type of data in your data lake, including both structured and unstructured data. As companies become more data-driven and need to make faster, more informed decisions, the need for analytics on an increasingly larger amount of data has become a higher priority in order to do business.

Real-time analytics with real-time databases

Real-time analytics is becoming increasingly used in conjunction with consumer-facing websites and services. This usually involves combining data that is being captured in real time with historical or archived data. Imagine if an e-commerce site had a history of your activity archived in an object store like S3, but your current session activity is getting written to a real-time database like Apache Pinot. Your current session activity may not make it into S3 for hours until the next snapshot. By using Presto to unite data across both systems, that website could provide you with real-time incentives so you don’t abandon your cart, or it could determine if there’s possible fraud happening earlier and with greater accuracy. In Chapter 2, we’ll show you how to do just that with Presto connected to Pinot, a real-time database.

Open Source Community

Presto was released under the permissive open-source Apache 2.0 license in 2013; however, it is not an Apache Software Foundation (ASF) project. Presto is instead governed by the Presto Foundation under the auspices of The Linux Foundation, who happen to know a lot about open source projects! It was and continues to be developed in the open on their public Github account, and even Facebook continues to put their new mainstream features upstream. This means Facebook is running the same version of code that you can download and use yourself for free. You can depend on the fact that Facebook and many other companies are constantly testing and improving Presto, using the collective power of a community of talented individuals working together to deliver quick development and troubleshooting. As an open source project, there are many offshoots of Presto, also known as forks of the codebase. PrestoDB is the main project and that is the main project of the future. As such, we encourage all new users to use the main upstream project of PrestoDB.

As someone learning more about Presto, helping others learn about this open source project can be very satisfying. As you become more familiar with the project, becoming a contributor can be a lot of fun, even if it is quite challenging at the beginning. If you have your own software project, you can make all the decisions and act very quickly, whereas open source tends to have a lot more discussion. Keeping in mind that the best projects attract the best people, when you contribute, you are getting in touch with some of the most talented engineers and architects out there. While it may be difficult to convince them, when that happens and your code is merged, it may likely give you a pretty solid sense of satisfaction. Then of course there is the reputation that comes along with a strong github profile. Your contributions highlight your expertise and can be very beneficial for advancing your career.

It is easy to get involved with the Presto Community and become part of the global network of passionate people improving Presto. New learners can start out with fixing documentation or beginner issues. You can join the Presto Foundation Slack channel, engage via Github Discussions, attend the virtual meetups, follow them on Twitter @prestodb, or request a new feature or file a bug: github.com/prestodb/presto.

Presto Foundation

The Presto Foundation is a non-profit organization that was formed with The Linux Foundation in September 2019 to support the development and community processes of the Presto open source project. As a part of The Linux Foundation, Presto Foundation ensures the open governance, transparency, and continued success of the project. Presto has experienced a steady growth in popularity over the years, with several companies emerging to support that growth. Members of the Presto Foundation provide essential financial support for the collaborative development process, including tooling, infrastructure, and community conferences. Presto Foundation membership also requires membership to the Linux Foundation. The current premier members are: Facebook, Uber, Twitter, Alibaba, Alluxio, and Ahana. All are welcome, including non-profits and academic institutions at a special membership rate.

Conclusion

In this chapter we tried to provide an overview of the Presto open source project. Where did it come from? How is it different from other data systems? How does it work and for what common use cases?

On a high level, we saw how Presto was initially developed for interactive, ad hoc analytics which needed low-latency queries, enabling end users with exploratory analysis on data sets of any size. We looked at key design considerations which make Presto flexible and suitable for a wide spectrum of use cases: the in-memory architecture, the compliance with the ANSI SQL standard make it immediately accessible without additional integration, the federation of data sources, and the coordinator-worker scalability.

We did brief comparisons of Presto with Hive and Presto with a traditional data warehouse. We saw how Presto is part of the disaggregated compute and storage stack–the latter having multiple back ends via the Presto connector architecture.

We then briefly looked at how Presto handles numerous SQL-based use cases:

Interactive, ad hoc querying
Reporting and dashboarding
ETL
Data lake analytics
Real-time analytics with real-time databases

As a fast growing open source project, you heard about ways to join the community and even gained some perspective on the open governance provided by the non-profit Presto Foundation.

Although this first chapter can’t make you an instant Presto expert, it hopefully armed you with enough context and high-level understanding of Presto.

We’ve learned from experience, and that is why this book is focused on both “Learning and Operating Presto.” We hope this book will help you become adept at operating Presto for your organization. So let’s conclude this chapter and start jumping into getting some hands-on experience with Presto!

¹ Hive was also originally developed and open sourced by Facebook.

² When Facebook donated PrestoDB to the Linux Foundation and established the Presto Foundation, Facebook announced the scale Presto was being used: “At Facebook alone, over a thousand employees use Presto, running several million queries and processing petabytes of data per day.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
1. Introducing Presto

Chapter 1. Introducing Presto

Presto Origins

What Is Presto?

Figure 1-1. Hive intermediate data sets are persisted to disk. Pesto executes tasks in-memory.

Note

Decoupled Storage and Compute for the Data Analytics Stack

Figure 1-2. Vertically integrated database / data warehouse architecture with compute and storage in one system.

Figure 1-3. Fig. 1-3: Data Storage using HDFS separated from compute for systems like Apache Hive.

Federation of Multiple Backend Data Sources

Note

Figure 1-4. The Presto federated query engine with multiple connectors processing data in a disaggregated stack from multiple data sources

Note

Figure 1-5. Examples of external data sources you can query with Presto

How Presto works

Figure 1-6. How Presto executes a SQL query

Note

Figure 1-7. How federated query engines like Presto allow you to run single data source or multi-data source queries (federated queries).

Presto query processing explained

Figure 1-8. Syntax tree

Presto Operations

Presto at Scale

Presto in the Cloud

Presto as an Analytics Platform

Ad hoc querying

Reporting and dashboarding

ETL using SQL

Data lake analytics

Real-time analytics with real-time databases

Open Source Community

Presto Foundation

Conclusion

Table of Contents for 1. Introducing Presto

Create new playlist

Sign In

Sign Up

Chapter 1. Introducing Presto

Presto Origins

What Is Presto?

Figure 1-1. Hive intermediate data sets are persisted to disk. Pesto executes tasks in-memory.

Note

Decoupled Storage and Compute for the Data Analytics Stack

Figure 1-2. Vertically integrated database / data warehouse architecture with compute and storage in one system.

Figure 1-3. Fig. 1-3: Data Storage using HDFS separated from compute for systems like Apache Hive.

Federation of Multiple Backend Data Sources

Note

Figure 1-4. The Presto federated query engine with multiple connectors processing data in a disaggregated stack from multiple data sources

Note

Figure 1-5. Examples of external data sources you can query with Presto

How Presto works

Figure 1-6. How Presto executes a SQL query

Note

Figure 1-7. How federated query engines like Presto allow you to run single data source or multi-data source queries (federated queries).

Presto query processing explained

Figure 1-8. Syntax tree

Presto Operations

Presto at Scale

Presto in the Cloud

Presto as an Analytics Platform

Ad hoc querying

Reporting and dashboarding

ETL using SQL

Data lake analytics

Real-time analytics with real-time databases

Open Source Community

Presto Foundation

Conclusion

Table of Contents for
1. Introducing Presto