Chapter 5. Production-Ready Deployment

Following the installation of Trino from the tar.gz archive in Chapter 2, and your new understanding of the Trino architecture from Chapter 4, you are now ready to learn more about the details of installing a Trino cluster. You can then take that knowledge and work toward a production-ready deployment of a Trino cluster with a coordinator and multiple worker nodes.

Configuration Details

The Trino configuration is managed in multiple files discussed in the following sections. They are all located in the etc directory located within the installation directory by default.

The default location of this folder, as well of as each individual configuration file, can be overridden with parameters passed to the launcher script, discussed in “Launcher”.

Server Configuration

The file etc/config.properties provides the configuration for the Trino server. A Trino server can function as a coordinator, or a worker, or both at the same time. Dedicating a single server to perform only coordinator work, and adding a number of other servers as dedicated workers, provides the best performance and creates a Trino cluster.

The contents of the file are of critical importance, specifically since they determine the role of the server as a worker or coordinator, which in turn affects resource usage and configuration.

Tip

All worker configurations in a Trino cluster should be identical.

The following are the basic allowed Trino server configuration properties. In later chapters, as we discuss features such as authentication, authorization, and resource groups, we cover additional optional properties.

coordinator=true|false

Allows this Trino instance to function as a coordinator and therefore accept queries from clients and manage query execution. Defaults to true. Setting the value to false dedicates the server as worker.

node-scheduler.include-coordinator=true|false

Allows scheduling work on the coordinator. Defaults to true. For larger clusters, we suggest setting this property to false. Processing work on the coordinator can impact query performance because the server resources are not available for the critical task of scheduling, managing, and monitoring query execution.

http-server.http.port=8080 and http-server.https.port=8443

Specifies the ports used for the server for the HTTP/HTTPS connection. Trino uses HTTP for all internal and external communication.

query.max-memory=5GB

The maximum amount of distributed memory that a query may use. This is described in greater detail in Chapter 12.

query.max-memory-per-node=1GB

The maximum amount of user memory that a query may use on any one machine. This is described in greater detail in Chapter 12.

query.max-total-memory-per-node=2GB

The maximum amount of user and system memory that a query may use on any one server. System memory is the memory used during execution by readers, writers, network buffers, etc. This is described in greater detail in Chapter 12.

discovery-server.enabled=true

Trino uses the discovery service to find all the nodes in the cluster. Every Trino instance registers with the discovery service on startup. To simplify deployment and avoid running an additional service, the Trino coordinator can run an embedded version of the discovery service. It shares the HTTP server with Trino and thus uses the same port. Typically set to true on the coordinator. Required to be disabled on all workers by removing the property.

discovery.uri=http://localhost:8080

The URI to the discovery server. When running the embedded version of discovery in the Trino coordinator, this should be the URI of the Trino coordinator, including the correct port. This URI must not end in a slash.

Logging

The optional Trino logging configuration file, etc/log.properties, allows setting the minimum log level for named logger hierarchies. Every logger has a name, which is typically the fully qualified name of the Java class that uses the logger. Loggers use the Java class hierarchy. The packages used for all components of Trino can be seen in the source code, discussed in “Source Code, License, and Version”.

For example, consider the following log levels file:

io.trino=INFO
io.trino.plugin.postgresql=DEBUG

The first line sets the minimum level to INFO for all classes inside io.trino, including nested packages such as io.trino.spi.connector and io.trino.plugin.hive. The default level is INFO, so the preceding example does not actually change logging for any packages in the first line. Having the default level in the file just makes the configuration more explicit. However, the second line overrides the logging configuration for the PostgreSQL connector to debug-level logging.

There are four levels, DEBUG, INFO, WARN, and ERROR, sorted by decreasing verbosity. Throughout the book, we may refer to setting logging when discussing topics such as troubleshooting in Trino.

Warning

When setting the logging levels, keep in mind that DEBUG levels can be verbose. Only set DEBUG on specific lower-level packages that you are actually troubleshooting to avoid creating large numbers of log messages, negatively impacting the performance of the system.

After starting Trino, you find the various log files in the var/log directory within the installation directory, unless you specified another location in the etc/node.properties file:

launcher.log

This log, created by the launcher (see “Launcher”), is connected to standard out (stdout) and standard error (stderr) streams of the server. It contains a few log messages from the server initialization and any errors or diagnostics produced by the JVM.

server.log

This is the main log file used by Trino. It typically contains the relevant information if the server fails during initialization, as well as most information concerning the actual running of the application, connections to data sources, and more.

http-request.log

This is the HTTP request log, which contains every HTTP request received by the server. These include all usage of the Web UI, Trino CLI, as well as JDBC or ODBC connection discussed in Chapter 3, since all of them operate using HTTP connections. It also includes authentication and authorizations logging.

All log files are automatically rotated and can also be configured in more detail in terms of size and compression.

Node Configuration

The node properties file, etc/node.properties, contains configuration specific to a single installed instance of Trino on a server—a node in the overall Trino cluster.

The following is a small example file:

node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/trino/data

The following parameters are the allowed Trino configuration properties:

node.environment=demo

The required name of the environment. All Trino nodes in a cluster must have the same environment name. The name shows up in the Trino Web UI header.

node.id=some-random-unique-string

An optional unique identifier for this installation of Trino. This must be unique for every node. This identifier should remain consistent across reboots or upgrades of Trino, and therefore be specified. If omitted, a random identifier is created with each restart.

node.data-dir=/var/trino/data

The optional filesystem path of the directory, where Trino stores log files and other data. Defaults to the var folder inside the installation directory.

JVM Configuration

The JVM configuration file, etc/jvm.config, contains a list of command-line options used for starting the JVM running Trino.

The format of the file is a list of options, one per line. These options are not interpreted by the shell, so options containing spaces or other special characters should not be quoted.

The following provides a good starting point for creating etc/jvm.config:

-server
-Xmx16G
-XX:-UseBiasedLocking
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
-XX:ReservedCodeCacheSize=512M
-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000
-Djdk.nio.maxCachedBufferSize=2000000
-Djdk.attach.allowAttachSelf=true

Because an OutOfMemoryError typically leaves the JVM in an inconsistent state, we write a heap dump for debugging and forcibly terminate the process when this occurs.

The -Xmx option is an important property in this file. It sets the maximum heap space for the JVM. This determines how much memory is available for the Trino process.

The configuration to allow the JDK/JVM to attach to itself is required for Trino usage since the update to Java 11.

More information about memory and other JVM settings is discussed in Chapter 12.

Launcher

As mentioned in Chapter 2, Trino includes scripts to start and manage Trino in the bin directory. These scripts require Python.

The run command can be used to start Trino as a foreground process.

In a production environment, you typically start Trino as a background daemon process:

$ bin/launcher start
Started as 48322

The number 48322 you see in this example is the assigned process ID (PID). It differs at each start.

You can stop a running Trino server, which causes it to shut down gracefully:

$ bin/launcher stop
Stopped 48322

When a Trino server process is locked or experiences other problems, it can be useful to forcefully stop it with the kill command:

$ bin/launcher kill
Killed 48322

You can obtain the status and PID of Trino with the status command:

$ bin/launcher status
Running as 48322

If Trino is not running, the status command returns that information:

$ bin/launcher status
Not running

Besides the mentioned commands, the launcher script supports numerous options that can be used to customize the configuration file locations and other parameters. The --help option can be used to display the full details:

$ bin/launcher --help
Usage: launcher [options] command

Commands: run, start, stop, restart, kill, status

Options:
  -h, --help                show this help message and exit
  -v, --verbose             Run verbosely
  --etc-dir=DIR             Defaults to INSTALL_PATH/etc
  --launcher-config=FILE    Defaults to INSTALL_PATH/bin/launcher.properties
  --node-config=FILE        Defaults to ETC_DIR/node.properties
  --jvm-config=FILE         Defaults to ETC_DIR/jvm.config
  --config=FILE             Defaults to ETC_DIR/config.properties
  --log-levels-file=FILE    Defaults to ETC_DIR/log.properties
  --data-dir=DIR            Defaults to INSTALL_PATH
  --pid-file=FILE           Defaults to DATA_DIR/var/run/launcher.pid
  --launcher-log-file=FILE  Defaults to DATA_DIR/var/log/launcher.log (only in
                            daemon mode)
  --server-log-file=FILE    Defaults to DATA_DIR/var/log/server.log (only in
                            daemon mode)
  -D NAME=VALUE             Set a Java system property

Other installation methods use these options to modify paths. For example, the RPM package, discussed in “RPM Installation”, adjusts the path to better comply with Linux filesystem hierarchy standards and conventions. You can use them for similar needs, such as complying with enterprise-specific standards, using specific mount points for storage, or simply using paths outside the Trino installation directory to ease upgrades.

Cluster Installation

In Chapter 2, we discussed installing Trino on a single machine, and in Chapter 4, you learned more about how Trino is designed and intended to be used in a distributed environment.

For any real use, other than for demo purposes, you need to install Trino on a cluster of machines. Fortunately, the installation and configuration is similar to installing on a single machine. It requires a Trino installation on each machine, either by installing manually or by using a deployment automation system like Ansible.

So far, you’ve deployed a single Trino server process to act as both a coordinator and a worker. For the cluster installation, you need to install and configure one coordinator and multiple workers.

Simply copy the downloaded tar.gz archive to all machines in the cluster and extract it.

As before, you have to add the etc folder with the relevant configuration files. A set of example configuration files for the coordinator and the workers is available in the cluster-installation directory of the support repository of the book; see “Book Repository”. The configuration files need to exist on every machine you want to be part of the cluster.

The configurations are the same as the simple installation for the coordinator and workers, with some important differences:

  • The coordinator property in config.properties is set to true on the coordinator and set to false on the workers.

  • The node-scheduler is set to exclude the coordinator.

  • The discovery-uri property has to point to the IP address or hostname of the coordinator on all workers and the coordinator itself.

  • The discovery server has to be disabled on the workers, by removing the property.

The main configuration file, etc/config.properties, suitable for the coordinator:

coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://<coordinator-ip-or-host-name>:8080

Note the difference of the configuration file, etc/config.properties, suitable for the workers:

coordinator=false
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://<coordinator-ip-or-host-name>:8080

With Trino installed and configured on a set of nodes, you can use the launcher to start Trino on every node. Generally, it is best to start the Trino coordinator first, followed by the Trino workers:

$ bin/launcher start

As before, you can use the Trino CLI to connect to the Trino server. In the case of a distributed setup, you need to specify the address of the Trino coordinator using the --server option. If you are running the Trino CLI on the Trino coordinator node directly, you do not need to specify this option, as it defaults to localhost:8080:

$ trino --server <coordinator-ip-or-host-name>:8080

You can now verify that the Trino cluster is running correctly. The nodes system table contains the list of all the active nodes that are currently part of the cluster. You can query it with a SQL query:

trino> SELECT * FROM system.runtime.nodes;
 node_id |        http_uri        | node_version | coordinator | state
---------+------------------------+--------------+---------------------
c00367d  | http://<http_uri>:8080 | 354          | true        | active
9408e07  | http://<http_uri>:8080 | 354          | false       | active
90dfc04  | http://<http_uri>:8080 | 354          | false       | active
(3 rows)

The list includes the coordinator and all connected workers in the cluster. The coordinator and each worker expose status and version information by using the REST API at the endpoint /v1/info; for example, http://worker-or-coordinator-host-name/v1/info.

You can also confirm the number of active workers using the Trino Web UI.

RPM Installation

Trino can be installed using the RPM Package Manager (RPM) on various Linux distributions such as CentOS, Red Hat Enterprise Linux, and others.

The RPM package is available on the Maven Central Repository at https://repo.maven.apache.org/maven2/io/trino/trino-server-rpm. Locate the RPM in the folder with the desired version and download it.

You can download the archive with wget; for example, for version 354:

$ wget https://repo.maven.apache.org/maven2/ 
io/trino/trino-server-rpm/354/trino-server-rpm-354.rpm

With administrative access, you can install Trino with the archive in single-node mode:

$ sudo rpm -i trino-server-rpm-*.rpm

The rpm installation creates the basic Trino configuration files and a service control script to control the server. The script is configured with chkconfig, so that the service is started automatically on the operating system boot. After installing Trino from the RPM, you can manage the Trino server with the service command:

service trino [start|stop|restart|status]

Installation Directory Structure

When using the RPM-based installation method, Trino is installed in a directory structure more consistent with the Linux filesystem hierarchy standards. This means that not everything is contained within the single Trino installation directory structure as we have seen so far. The service is configured to pass the correct paths to Trino with the launcher script:

/usr/lib/trino/

The directory contains the various libraries needed to run the product. Plug-ins are located in a nested plugin directory.

_/etc/trino

This directory contains the general configuration files such as node.properties, jvm.config, and config.properties. Catalog configurations are located in a nested catalog directory.

/etc/trino/env.sh

This file sets the Java installation path used.

_/var/log/trino

This directory contains the log files.

/var/lib/trino/data

This is the data directory.

_/etc/rc.d/init.d/trino

This directory contains the service scripts for controlling the server process.

The node.properties file requires the following two additional properties, since our directory structure is different from the defaults used by Trino:

catalog.config-dir=/etc/trino/catalog
plugin.dir=/usr/lib/trino/plugin

Configuration

The RPM package installs Trino acting as coordinator and worker out of the box, identical to the tar.gz archive. To create a working cluster, you can update the configuration files on the nodes in the cluster manually, use the trino-admin tool, or use a generic configuration management and provisioning tool such as Ansible.

Uninstall Trino

If Trino is installed using RPM, you can uninstall it the same way you remove any other RPM package:

$ rpm -e trino

When removing Trino, all files and configurations, apart from the logs directory /var/log/trino, are deleted. Create a backup copy if you wish to keep anything.

Installation in the Cloud

A typical installation of Trino involves running at least one cluster with a coordinator and multiple workers. Over time, the number of workers in the cluster, as well as the number of clusters, can change based on the demand from users.

The number and type of connected data sources, as well as their location, also has a major impact on choosing where to install and run your Trino cluster. Typically, it is desirable that the Trino cluster has a high-bandwidth, low-latency network connectivity to the data sources.

The simple requirements of Trino, discussed in Chapter 2, allow you to run Trino in many situations. You can run it on different machines such as physical servers or virtual machines, as well as Docker containers.

Trino is known to work on private cloud deployments as well as on many public cloud providers including Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure, and others.

Using containers allows you to run Trino on Kubernetes (k8s) clusters such as Amazon Elastic Kubernetes Service (Amazon EKS), Microsoft Azure Kubernetes Service (AKS), Google Kubernetes Engine (GKE), Red Hat Open Shift, and any other Kubernetes deployments.

An advantage of these cloud deployments is the potential for a highly dynamic cluster, where workers are created and destroyed on demand. Tooling for such use cases has been created by different users, including cloud vendors embedding Trino in their offerings and other vendors offering Trino tooling and support.

Tip

The Trino project does not provide a complete set of suitable resources and tooling for running a Trino cluster in a turn-key, hands-off fashion. Organizations typically create their own packages, configuration management setups, container images, k8s operators, or whatever is necessary, and they use tools such as Concord or Terraform to create and manage the clusters. Alternatively, you can consider relying on the support and offerings from a company like Starburst.

Cluster Sizing Considerations

An important part of getting Trino deployed is sizing the cluster. In the longer run, you might work toward multiple clusters for different use cases. Sizing the Trino cluster is a complex task and follows the same patterns and steps as other applications:

  1. Decide on an initial size, based on rough estimates and available infrastructure.

  2. Ensure that the tooling and infrastructure for the cluster is able to scale the cluster.

  3. Start the cluster and ramp up usage.

  4. Monitor utilization and performance.

  5. React to the findings by changing cluster scale and configuration.

The feedback loop around monitoring, adapting, and continued use allows you to get a good understanding of the behavior of your Trino deployment.

Many factors influence your cluster performance, and the combination of these is specific to each Trino deployment:

  • Resources like CPU and memory for each node

  • Network performance within the cluster and to data sources and storage

  • Number and characteristics of connected data sources

  • Queries run against the data sources and their scope, complexity, number, and resulting data volume

  • Storage read/write performance of the data sources

  • Active users and their usage patterns

Once you have your initial cluster deployed, make sure you take advantage of using the Trino Web UI for monitoring. Chapter 12 provides more tips.

Conclusion

As you’ve now learned, Trino installation and running a cluster requires just a handful of configuration files and properties. Depending on your actual infrastructure and management system, you can achieve a powerful setup of one or even multiple Trino clusters. Check out real-world examples in Chapter 13.

Of course, you are still missing a major ingredient of configuring Trino. And that is the connections to the external data sources that your users can then query with Trino and SQL. In Chapter 6 and Chapter 7, you get to learn all about the various data sources, the connectors to access them, and the configuration of the catalogs that point at specific data sources using the connectors.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.229.253