Foreword

In the mid-1980s, the phrase “data warehouse” was not in use. The concept of collecting data from disparate sources, finding a historical record, and then integrating it all into one repository was barely technically possible. The biggest relational databases in the world did not exceed 50 GB in size. The microprocessor revolution was just getting underway, and two companies stood out: Tandem, who lashed together microprocessors and distributed Online Transaction Processing (OLTP) across the cluster; and Teradata, who clustered microprocessors and distributed data to solve the big data problem. Teradata named the company from the concept of a terabyte of data—1,000 GB—an unimaginable amount of data at the time.

Until the early 2000s Teradata owned the big data space, offering its software on a cluster of proprietary servers that scaled beyond its original 1 TB target. The database market seemed set and stagnant with Teradata at the high end; Oracle and Microsoft’s SQL Server product in the OLTP space; and others working to hold on to their diminishing share.

But in 1999, a new competitor, soon to be renamed Netezza, entered the market with a new proprietary hardware design and a new indexing technology, and began to take market share from Teradata.

By 2005, other competitors, encouraged by Netezza’s success, entered the market. Two of these entrants are noteworthy. In 2003, Greenplum entered the market with a product based on PostgreSQL, that utilized the larger memory in modern servers to good effect with a data flow architecture, and that reduced costs by deploying on commodity hardware. In 2005, Vertica was founded based on a major reengineering of the columnar architecture first implemented by Sybase. The database world would never again be stagnant.

This book is about Greenplum, and there are several important characteristics of this technology that are worth pointing out.

The concept of flowing data from one step in the query execution plan to another without writing it to disk was not invented at Greenplum, but it implemented this concept effectively. This resulted in a significant performance advantage.

Just as important, Greenplum elected to deploy on regular, nonproprietary hardware. This provided several advantages. First, Greenplum did not need to spend R&D dollars engineering systems. Next, customers could buy hardware from their favorite providers using any volume purchase agreements that they already might have had in place. In addition, Greenplum could take advantage of the fact that the hardware vendors tended to leapfrog one another in price and performance every four to six months. Greenplum was achieving a 5 to 15 percent price/performance boost several times a year—for free. Finally, the hardware vendors became a sales channel. Big players like IBM, Dell, and HP would push Greenplum over other players if they could make the hardware sale.

Building Greenplum on top of PostgreSQL was also noteworthy. Not only did this allow Greenplum to offer a mature product much sooner, it could use system administration, backup and restore, and other PostgreSQL assets without incurring the cost of building them from scratch. The architecture of PostgreSQL, which was designed for extensibility by a community, provided a foundation from which Greenplum could continuously grow core functionality.

Vertica was proving that a full implementation of a columnar architecture offered a distinct advantage for complex queries against big data, so Greenplum quickly added a sophisticated columnar capability to its product. Other vendors were much slower to react and then could only offer parts of the columnar architecture in response. The ability to extend the core paid off quickly, and Greenplum’s implementation of columnar still provides a distinct advantage in price and performance.

Further, Greenplum saw an opportunity to make a very significant advance in the way big data systems optimize queries, and thus the ORCA optimizer was developed and deployed.

During the years following 2006, these advantages paid off and Greenplum’s market share grew dramatically until 2010.

In early 2010, the company decided to focus on a part of the data warehouse space for which sophisticated analytics were the key. This strategy was in place when EMC acquired Greenplum in the middle of that year. The EMC/Greenplum match was odd. First, the niche approach toward analytics and away from data warehousing and big data would not scale to the size required by such a large enterprise. Next, the fundamental shared-nothing architecture was an odd fit in a company whose primary products were shared storage devices. Despite this, EMC worked diligently to make a fit and it made a significant financial investment to make it go. In 2011, Greenplum implemented a new strategy and went “all-in” on Hadoop. It was no longer “all-in” on the Greenplum Database.

In 2013, EMC spun the Greenplum division into a new company, Pivotal Software. From that time to the present, several important decisions were made with regard to the Greenplum Database. Importantly, the product is now open sourced. Like many open source products, the bulk of the work is done by Pivotal, but a community is growing. The growth is fueled by another important decision: to reemphasize the use of PostgreSQL at the core.

The result of this is a vibrant Greenplum product that retains the aforementioned core value proposition—the product runs on hardware from your favorite supplier; the product is fast and supports both columnar and tabular tables; the product is extensible and Pivotal has an ambitious plan in place that is feasible.

The bottom line is that the Greenplum Database is capable of winning any fair competition and should be considered every time.

I am a fan.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.203.137