Technology products evolve over time. Greenplum forked from the mainline branch of PostgreSQL at release 8.2.15, but continued to add new PostgreSQL features. PostgreSQL also evolved over time, and Pivotal began the process of reintegrating Greenplum into PostgreSQL with the goals of introducing the useful new features of later releases of PostgreSQL into Greenplum while also adding Greenplum-specific features into PostgreSQL.
This process began in release 5 of Greenplum in 2017 and continues with release 6 of Greenplum in 2019.
Following is a list of the new features in Greenplum 5. See later sections of the book for more details on some of these features.
PostgreSQL introduced a data file format change in release 8.4. Pivotal’s goal of rejoining the PostgreSQL code line is a gradual process. In Greenplum 5, we achieved parity with PostgreSQL 8.4. That meant a migration of the data files. There are too many new features in this release to list them all; here are a few important ones:
These are collections of open source packages that data scientists find useful. They can be used in conjunction with the procedural languages for writing sophisticated analytic routines.
JSON, UUID, and improved XML support.
The GPORCA query optimizer has increased support for more complex queries.
PXF is a framework for accessing data external to Greenplum. This is discussed in Chapter 8.
Critical for good query performance is the understanding of the size and data contents of tables. This utility was enhanced to cover more use cases and provide increased performance.
Python and R and untrusted languages because they contain OS callouts. As a result, only the database superuser can create functions in these languages. Greenplum 5 added the ability to run such functions in a container isolated from the OS proper so superuser powers are not required.
Enhancements to the tools used to back up and restore Greenplum. These will be discussed in Chapter 7.
The ability to control queries cannot be underestimated in an analytic environment. The new resource group mechanism is discussed in Chapter 7.
Kafka has emerged as a leading technology for data dissemination and integration for real-time and near-real-time data streaming. Its integration with Greenplum is discussed in Chapter 8.
Critical for efficient use of Greenplum is the ability to understand what is occurring in Greenplum now and in the past. This is discussed in Chapter 7.
This is a list of the new features in Greenplum 6. Some features are explored in more detail later in the book.
Greenplum 6 continued the integration of later PostgreSQL releases and is now fully compatible with PostgreSQL 9.4. Pivotal is on a quest to add more PostgreSQL compatibility with each new major release.
Pivotal Greenplum now has integrated the PostgreSQL 9.4 code base. This opens up new features and absorbs many performance improvements.
WAL is a PostgreSQL method for assuring data integrity. Though beyond the scope of this book, more information about it is located in the High Availability section of the Administrator Guide.
Prior to Greenplum 6, updates to tables required locking the entire table. The introduction of locks to single rows can improve performance by a factor of 50.
The foreign data wrapper API allows Greenplum to access other data sources as though they were PostgreSQL or Greenplum tables. This is discussed in Chapter 8.
The inclusion of PostgreSQL 9.4 code brings along many utilities that depend upon 9.4 functionality in the database. pgaudit is a contributed tool that makes use of that.
CTEs are like temporary tables, but they only exist for the duration of the SQL statement. Recursive CTEs reference themselves and are useful in querying hierarchical data.
These are specialized indexes for multicolumn and text-based searches. They are not discussed in this book.
Greenplum 6 now uses row-level locking for data manipulation language (DML) operations. This has an enormous impact on the performance of these operations, which often occur in ETL and data cleansing.
Replicated tables have long been requested by customers. These are discussed in Chapter 4.
zStandard is a fast lossless compression algorithm.
Cluster expansion, though a rare event, requires computational and disk access resources to redistribute the data. A new algorithm minimizes this time.
Useful in on-premises cloud deployment; this is discussed in Chapter 3.
Other than performance improvements, these are mostly transparent to the user community.
The Diskquota extension provides disk usage enforcement for database objects. It sets a limit on the amount of disk space that a schema or a role can use.
The Greenplum release notes contain information specific to the Greenplum version:
New features
Changed features
Beta features
Deprecated features
Resolved issues
Known issues and limitations
18.225.55.198