0%

Perform fast interactive analytics against different data sources using the Trino high-performance distributed SQL query engine. With this practical guide, you'll learn how to conduct analytics on data where it lives, whether it's Hive, Cassandra, a relational database, or a proprietary data store. Analysts, software engineers, and production engineers will learn how to manage, use, and even develop with Trino.

Initially developed by Facebook, open source Trino is now used by Amazon, Google, LinkedIn, Lyft, Netflix, Pinterest, Salesforce, Shopify, and many other companies. Matt Fuller, Manfred Moser, and Martin Traverso show you how a single Trino query can combine data from multiple sources to allow for analytics across your entire organization.

  • Get started: Explore Trino's use cases and learn about tools that will help you connect to Trino and query data
  • Go deeper: Learn Trino's internal workings, including how to connect to and query data sources with support for SQL statements, operators, functions, and more
  • Put Trino in production: Secure Trino, monitor workloads, tune queries, and connect more applications; learn how other organizations apply Trino

Table of Contents

  1. Foreword
  2. Preface
    1. About the Book
    2. Conventions Used in This Book
    3. Code Examples, Permissions, and Attribution
    4. O’Reilly Online Learning
    5. How to Contact Us
    6. Acknowledgments
  3. I. Getting Started with Trino
  4. 1. Introducing Trino
    1. The Problems with Big Data
    2. Trino to the Rescue
    3. Designed for Performance and Scale
    4. SQL-on-Anything
    5. Separation of Data Storage and Query Compute Resources
    6. Trino Use Cases
    7. One SQL Analytics Access Point
    8. Access Point to Data Warehouse and Source Systems
    9. Provide SQL-Based Access to Anything
    10. Federated Queries
    11. Semantic Layer for a Virtual Data Warehouse
    12. Data Lake Query Engine
    13. SQL Conversions and ETL
    14. Better Insights Due to Faster Response Times
    15. Big Data, Machine Learning, and Artificial Intelligence
    16. Other Use Cases
    17. Trino Resources
    18. Website
    19. Documentation
    20. Community Chat
    21. Source Code, License, and Version
    22. Contributing
    23. Book Repository
    24. Iris Data Set
    25. Flight Data Set
    26. A Brief History of Trino
    27. Conclusion
  5. 2. Installing and Configuring Trino
    1. Trying Trino with the Docker Container
    2. Installing from Archive File
    3. Java Virtual Machine
    4. Python
    5. Installation
    6. Configuration
    7. Adding a Data Source
    8. Running Trino
    9. Conclusion
  6. 3. Using Trino
    1. Trino Command-Line Interface
    2. Getting Started
    3. Pagination
    4. History
    5. Additional Diagnostics
    6. Executing Queries
    7. Output Formats
    8. Ignoring Errors
    9. Trino JDBC Driver
    10. Downloading and Registering the Driver
    11. Establishing a Connection to Trino
    12. Trino and ODBC
    13. Client Libraries
    14. Trino Web UI
    15. SQL with Trino
    16. Concepts
    17. First Examples
    18. Conclusion
  7. II. Diving Deeper into Trino
  8. 4. Trino Architecture
    1. Coordinator and Workers in a Cluster
    2. Coordinator
    3. Discovery Service
    4. Workers
    5. Connector-Based Architecture
    6. Catalogs, Schemas, and Tables
    7. Query Execution Model
    8. Query Planning
    9. Parsing and Analysis
    10. Initial Query Planning
    11. Optimization Rules
    12. Predicate Pushdown
    13. Cross Join Elimination
    14. TopN
    15. Partial Aggregations
    16. Implementation Rules
    17. Lateral Join Decorrelation
    18. Semi-Join (IN) Decorrelation
    19. Cost-Based Optimizer
    20. The Cost Concept
    21. Cost of the Join
    22. Table Statistics
    23. Filter Statistics
    24. Table Statistics for Partitioned Tables
    25. Join Enumeration
    26. Broadcast Versus Distributed Joins
    27. Working with Table Statistics
    28. Trino ANALYZE
    29. Gathering Statistics When Writing to Disk
    30. Hive ANALYZE
    31. Displaying Table Statistics
    32. Conclusion
  9. 5. Production-Ready Deployment
    1. Configuration Details
    2. Server Configuration
    3. Logging
    4. Node Configuration
    5. JVM Configuration
    6. Launcher
    7. Cluster Installation
    8. RPM Installation
    9. Installation Directory Structure
    10. Configuration
    11. Uninstall Trino
    12. Installation in the Cloud
    13. Cluster Sizing Considerations
    14. Conclusion
  10. 6. Connectors
    1. Configuration
    2. RDBMS Connector Example PostgreSQL
    3. Query Pushdown
    4. Parallelism and Concurrency
    5. Other RDBMS Connectors
    6. Security
    7. Trino TPC-H and TPC-DS Connectors
    8. Hive Connector for Distributed Storage Data Sources
    9. Apache Hadoop and Hive
    10. Hive Connector
    11. Hive-Style Table Format
    12. Managed and External Tables
    13. Partitioned Data
    14. Loading Data
    15. File Formats and Compression
    16. MinIO Example
    17. Non-Relational Data Sources
    18. Trino JMX Connector
    19. Black Hole Connector
    20. Memory Connector
    21. Other Connectors
    22. Conclusion
  11. 7. Advanced Connector Examples
    1. Connecting to HBase with Phoenix
    2. Key-Value Store Connector Example: Accumulo
    3. Using the Trino Accumulo Connector
    4. Predicate Pushdown in Accumulo
    5. Apache Cassandra Connector
    6. Streaming System Connector Example: Kafka
    7. Document Store Connector Example: Elasticsearch
    8. Overview
    9. Configuration and Usage
    10. Query Processing
    11. Full-Text Search
    12. Summary
    13. Query Federation in Trino
    14. Extract, Transform, Load and Federated Queries
    15. Conclusion
  12. 8. Using SQL in Trino
    1. Trino Statements
    2. Trino System Tables
    3. Catalogs
    4. Schemas
    5. Information Schema
    6. Tables
    7. Table and Column Properties
    8. Copying an Existing Table
    9. Creating a New Table from Query Results
    10. Modifying a Table
    11. Deleting a Table
    12. Table Limitations from Connectors
    13. Views
    14. Session Information and Configuration
    15. Data Types
    16. Collection Data Types
    17. Temporal Data Types
    18. Type Casting
    19. SELECT Statement Basics
    20. WHERE Clause
    21. GROUP BY and HAVING Clauses
    22. ORDER BY and LIMIT Clauses
    23. JOIN Statements
    24. UNION, INTERSECT, and EXCEPT Clauses
    25. Grouping Operations
    26. WITH Clause
    27. Subqueries
    28. Scalar Subquery
    29. EXISTS Subquery
    30. Quantified Subquery
    31. Deleting Data from a Table
    32. Conclusion
  13. 9. Advanced SQL
    1. Functions and Operators Introduction
    2. Scalar Functions and Operators
    3. Boolean Operators
    4. Logical Operators
    5. Range Selection with the BETWEEN Statement
    6. Value Detection with IS (NOT) NULL
    7. Mathematical Functions and Operators
    8. Trigonometric Functions
    9. Constant and Random Functions
    10. String Functions and Operators
    11. Strings and Maps
    12. Unicode
    13. Regular Expressions
    14. Unnesting Complex Data Types
    15. JSON Functions
    16. Date and Time Functions and Operators
    17. Histograms
    18. Aggregate Functions
    19. Map Aggregate Functions
    20. Approximate Aggregate Functions
    21. Window Functions
    22. Lambda Expressions
    23. Geospatial Functions
    24. Prepared Statements
    25. Conclusion
  14. III. Trino in Real-World Uses
  15. 10. Security
    1. Authentication
    2. Password and LDAP Authentication
    3. Authorization
    4. System Access Control
    5. Connector Access Control
    6. Encryption
    7. Encrypting Trino Client-to-Coordinator Communication
    8. Creating Java Keystores and Java Truststores
    9. Encrypting Communication Within the Trino Cluster
    10. Certificate Authority Versus Self-Signed Certificates
    11. Certificate Authentication
    12. Kerberos
    13. Prerequisites
    14. Kerberos Client Authentication
    15. Cluster Internal Kerberos
    16. Data Source Access and Configuration for Security
    17. Kerberos Authentication with the Hive Connector
    18. Hive Metastore Thrift Service Authentication
    19. HDFS Authentication
    20. Cluster Separation
    21. Conclusion
  16. 11. Integrating Trino with Other Tools
    1. Queries, Visualizations, and More with Apache Superset
    2. Performance Improvements with RubiX
    3. Workflows with Apache Airflow
    4. Embedded Trino Example: Amazon Athena
    5. Starburst Enterprise
    6. Other Integration Examples
    7. Custom Integrations
    8. Conclusion
  17. 12. Trino in Production
    1. Monitoring with the Trino Web UI
    2. Cluster-Level Details
    3. Query List
    4. Query Details View
    5. Tuning Trino SQL Queries
    6. Memory Management
    7. Task Concurrency
    8. Worker Scheduling
    9. Scheduling Splits per Task and per Node
    10. Local Scheduling
    11. Network Data Exchange
    12. Concurrency
    13. Buffer Sizes
    14. Tuning Java Virtual Machine
    15. Resource Groups
    16. Resource Group Definition
    17. Scheduling Policy
    18. Selector Rules Definition
    19. Conclusion
  18. 13. Real-World Examples
    1. Deployment and Runtime Platforms
    2. Cluster Sizing
    3. Hadoop/Hive Migration Use Case
    4. Other Data Sources
    5. Users and Traffic
    6. Conclusion
  19. 14. Conclusion
  20. Index
44.200.169.91