Chapter 2. Deploying Greenplum

As the world continues to evolve, Greenplum is embracing change and now includes four deployment options. With the trend toward cloud-based computing, Greenplum has added public and private cloud to its list of deployment options. This gives Greenplum users four options for production deployment.

  • Customer-built clusters

  • Appliance

  • Public cloud

  • Private cloud

Custom(er)-Built Clusters

Initially, Greenplum was a software-only company. It provided a cluster-aware installer but assumed that the customer had correctly built the cluster. This strategy provided a certain amount of flexibility. For example, customers could configure exactly the number of segment hosts they required and could add hosts when needed. They had no restrictions on which network gear to use, how much memory per node, nor the number or size of the disks. On the other hand, building a cluster is considerably more difficult than configuring a single server. To this end, Greenplum has a number of facilities that assist customers in building clusters.

Today, there is much greater experience and understanding in building MPP clusters, but a decade ago, this was much less true and some early deployments were sub-optimal due to poor cluster design and its effect on performance. The gpcheckperf utility checks disk I/O bandwidth, network performance, and memory bandwidth. Assuring a healthy cluster before Greenplum deployment goes a long way to having a performant cluster.

Customer-built clusters should be constructed with the following principles in mind:

  • Greenplum wants consistent high performance read/write throughput. This almost always means servers with internal disks. Some customers have built clusters using large shared storage-area network (SAN) devices. Even though input/output operations per second (IOPS) numbers can be impressive for these devices, that’s not what is important for Greenplum, which does large sequential reads and writes. A single SAN device attached to a large number of segment servers often falls short in terms of how much concurrent sequential I/O it can support.

  • Greenplum wants consistently high network performance. A 10 GB network is required.

  • Greenplum, and virtually every other database, likes to have plenty of memory. Greenplum recommends at least 256 GB of RAM per segment server. All other things being equal, the more memory, the greater concurrency is possible. That is, more analytic queries can be run simultaneously.

Some customers doing extensive mathematical analytics find that they are CPU limited. Given the Greenplum segment model, more cores would benefit these customers. More cores will mean more segments and thus more processing power. When performance is less than expected, it’s important to know the gating factor. In general, it is memory or I/O rather than CPU.

The Pivotal Clustering Concepts Guide should be required reading if you are deploying Greenplum on your own hardware.

Appliance

After Greenplum was purchased by EMC, the company provided an appliance called the Data Computing Appliance (DCA), which soon became the predominant deployment option. Many early customers did not have the kind of expertise to build and manage a cluster. They did not want to deal with assuring that the OS version and patches on their servers were in accordance with the database version. They did not want to upgrade and patch the OS and Greenplum. For them, the DCA was a very good solution. The current DCA v3 is an EMC/Dell hardware and software combination that includes the servers, disks, network, control software, and also the Greenplum Database. It also offers support and service from Dell EMC for the hardware and OS and Pivotal for the Greenplum software.

There are advantages of the DCA over a customer-built cluster.

The DCA comes with all the relevant software installed and tested. All that is required at the customer site is a physical installation and site-specific configuration such as external IP address setup. In general, enterprises find this faster than buying hardware, building a cluster, and installing Greenplum. It also ensures that all known security vulnerabilities have been identified and fixed. If enabled, the DCA will “call home” when its monitoring capabilities detect unusual conditions.

The DCA has some limitations. The number of segment hosts must be a multiple of four. No other software can be placed on the nodes without vendor approval. It comes with fixed memory and disk configurations.

Public Cloud

Many organizations have decided to move much of their IT environment away from their own datacenters to the cloud. Public cloud deployment offers the quickest time to deployment. You can configure a functioning Greenplum cluster in less than an hour.

Pivotal now has Greenplum available, tested, and certified on AWS and Microsoft Azure and plans to have an implementation on Google Cloud Platform (GCP) by the first half of 2017. In Amazon, the Cloud Formation Scripts define the nodes in the cluster and the software that resides there. Pivotal takes a very opinionated view of the configurations it will support in the public cloud marketplaces; for example:

  • Only certain numbers of segment hosts are supported using the standard scripts for cluster builds.

  • Only certain kinds of nodes will be offered for the master, standby, and segment servers.

  • Only high-performance 10 Gb interconnects are offered. For performance and support reasons, there are only some Amazon Machine Images (AMIs) that are available.

These are not arbitrary choices. Greenplum has tested these configurations and found them suitable for efficient operation in the cloud. That being said, customers have built their own clusters without using the Greenplum-specific deployment options on both Azure and AWS. Although they are useful for QA and development needs, these configurations might not be optimal for use in a production environment.

Private Cloud

Some enterprises have legal or policy restrictions that prevent data from moving into a public cloud environment. These organizations can deploy Greenplum on private clouds, mostly running VMware infrastructure. Details are available in the Greenplum documentation, but some important principles apply to all virtual deployments:

  • There must be adequate network bandwidth between the virtual hosts.

  • Shared-disk contention is a performance hindrance.

  • Automigration must be turned off.

  • As in real hardware systems, primary and mirror segments must be separated, not only on different virtual servers, but also on physical infrastructure.

  • Adequate memory to prevent swapping is a necessity.

If these shared-resource conditions are met, Greenplum will perform well in a virtual environment.

Choosing a Greenplum Deployment

The choice of a deployment depends upon a number of factors that must be balanced:

Security
Is a public cloud possible or does it require security vetting?
Cost
The cost of a public cloud deployment can be large if Greenplum needs to run 24 hours a day, 365 days a year. This is an operational expense that must be balanced against the capital expense of purchasing a DCA or hardware for a custom-built cluster. The operational chargeback cost of running either a private cloud, DCA, or customer-build cluster varies widely among datacenters. Moving data to and from a site in the enterprise to the cloud can be costly.
Time-to-usability
A public cloud cluster can be built in an hour or so. A private cloud configuration in a day. A customer-built cluster takes much longer. A DCA requires some lead time in ordering and installing.

Greenplum Sandbox

Although not for production use, Greenplum distributes a sandbox in both VirtualBox and VMware format. This VM contains a small configuration of Greenplum Database with some sample data and scripts that illustrate the principles of Greenplum. It is freely available at PivNet. The sandbox also is in AWS. This blog post provides more detail.

The sandbox has no high availability features. There is no standby master nor does it have mirror segments. It is built to demonstrate the features of Greenplum, not for database performance. It’s of invaluable help in learning about Greenplum.

Learning More

You can find a more detailed introduction to the DCA, the Greenplum appliance, at the EMC-Dell Greenplum DCA. The Getting Started Guide contains much more details about architecture, site planning, and administration. A search for “DCA” at the EMC-Dell website will yield a list of additional documentation. To subset the information to the most recent version of the DCA, click the radio button “Within the Last Year” under “Last Updated.”

Building a cluster for the first time is likely to be a challenge. To minimize the time to deployment, Greenplum has published two very helpful documents on clustering: the Clustering Concepts Guide, mentioned in the introductory chapter of this book and a website devoted to clustering material. Both should be mandatory reading before you undertake building a custom configuration. Pivotal provides advice for building Greenplum in virtual environments.

Pivotal’s Andreas Scherbaum has produced Ansible scripts for deploying Greenplum. These is not an officially supported endeavor and requires basic knowledge of Ansible.

Those interested in running Greenplum in the public cloud should consult the offerings on AWS and Microsoft Azure. There is a brief video on generating an Azure deployment that walks through the steps to create a Greenplum cluster.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.227.52