C H A P T E R  1

What Is Exadata?

No doubt you already have a pretty good idea what Exadata is or you wouldn’t be holding this book in your hands. In our view, it is a preconfigured combination of hardware and software that provides a platform for running Oracle Database (version 11g Release 2 as of this writing). Since the Exadata Database Machine includes a storage subsystem, new software has been developed to run at the storage layer. This has allowed the developers to do some things that are just not possible on other platforms. In fact, Exadata really began its life as a storage system. If you talk to people involved in the development of the product, you will commonly hear them refer the storage component as Exadata or SAGE (Storage Appliance for Grid Environments), which was the code name for the project.

Exadata was originally designed to address the most common bottleneck with very large databases, the inability to move sufficiently large volumes of data from the disk storage system to the database server(s). Oracle has built its business by providing very fast access to data, primarily through the use of intelligent caching technology. As the sizes of databases began to outstrip the ability to cache data effectively using these techniques, Oracle began to look at ways to eliminate the bottleneck between the storage tier and the database tier. The solution they came up with was a combination of hardware and software. If you think about it, there are two approaches to minimizing this bottleneck. The first is to make the pipe bigger. While there are many components involved, and it’s a bit of an oversimplification, you can think of InfiniBand as that bigger pipe. The second way to minimize the bottleneck is to reduce the amount of data that needs to be transferred. This they did with Smart Scans. The combination of the two has provided a very successful solution to the problem. But make no mistake; reducing the volume of data flowing between the tiers via Smart Scan is the golden goose.

images Kevin Says: The authors have provided an accurate list of approaches for alleviating the historical bottleneck between storage and CPU for DW/BI workloads—if, that is, the underlying mandate is to change as little in the core Oracle Database kernel as possible. From a pure computer science perspective, the list of solutions to the generic problem of data flow between storage and CPU includes options such as co-locating the data with the database instance—the “shared-nothing” MPP approach. While it is worthwhile to point this out, the authors are right not to spend time discussing the options dismissed by Oracle.

In this introductory chapter we’ll review the components that make up Exadata, both hardware and software. We’ll also discuss how the parts fit together (the architecture). We’ll talk about how the database servers talk to the storage servers. This is handled very differently than on other platforms, so we’ll spend a fair amount of time covering that topic. We’ll also provide some historical context. By the end of the chapter, you should have a pretty good feel for how all the pieces fit together and a basic understanding of how Exadata works. The rest of the book will provide the details to fill out the skeleton that is built in this chapter.

images Kevin Says: In my opinion, Data Warehousing / Business Intelligence practitioners, in an Oracle environment, who are interested in Exadata, must understand Cell Offload Processing fundamentals before any other aspect of the Exadata Database Machine. All other technology aspects of Exadata are merely enabling technology in support of Cell Offload Processing. For example, taking too much interest, too early, in Exadata InfiniBand componentry is simply not the best way to build a strong understanding of the technology. Put another way, this is one of the rare cases where it is better to first appreciate the whole cake before scrutinizing the ingredients. When I educate on the topic of Exadata, I start with the topic of Cell Offload Processing. In doing so I quickly impart the following four fundamentals:

Cell Offload Processing: Work performed by the storage servers that would otherwise have to be executed in the database grid. It includes functionality like Smart Scan, data file initialization, RMAN offload, and Hybrid Columnar Compression (HCC) decompression (in the case where In-Memory Parallel Query is not involved).

Smart Scan: The most relevant Cell Offload Processing for improving Data Warehouse / Business Intelligence query performance. Smart Scan is the agent for offloading filtration, projection, Storage Index exploitation, and HCC decompression.

Full Scan or Index Fast Full Scan: The required access method chosen by the query optimizer in order to trigger a Smart Scan.

Direct Path Reads: Required buffering model for a Smart Scan. The flow of data from a Smart Scan cannot be buffered in the SGA buffer pool. Direct path reads can be performed for both serial and parallel queries. Direct path reads are buffered in process PGA (heap).

An Overview of Exadata

A picture’s worth a thousand words, or so the saying goes. Figure 1-1 shows a very high-level view of the parts that make up the Exadata Database Machine.

images

Figure 1-1. High-level Exadata components

When considering Exadata, it is helpful to divide the entire system mentally into two parts, the storage layer and the database layer. The layers are connected via an InfiniBand network. InfiniBand provides a low-latency, high-throughput switched fabric communications link. It provides redundancy and bonding of links. The database layer is made up of multiple Sun servers running standard Oracle 11gR2 software. The servers are generally configured in one or more RAC clusters, although RAC is not actually required. The database servers use ASM to map the storage. ASM is required even if the databases are not configured to use RAC. The storage layer also consists of multiple Sun servers. Each storage server contains 12 disk drives and runs the Oracle storage server software (cellsrv). Communication between the layers is accomplished via iDB, which is a network based protocol that is implemented using InfiniBand. iDB is used to send requests for data along with metadata about the request (including predicates) to cellsrv. In certain situations, cellsrv is able to use the metadata to process the data before sending results back to the database layer. When cellsrv is able to do this it is called a Smart Scan and generally results in a significant decrease in the volume of data that needs to be transmitted back to the database layer. When Smart Scans are not possible, cellsrv returns the entire Oracle block(s). Note that iDB uses the RDS protocol, which is a low-latency protocol that bypasses kernel calls by using remote direct memory access (RDMA) to accomplish process-to-process communication across the InfiniBand network.

History of Exadata

Exadata has undergone a number of significant changes since its initial release in late 2008. In fact, one of the more difficult parts of writing this book has been keeping up with the changes in the platform during the project. Here’s a brief review of the product’s lineage and how it has changed over time.

images Kevin Says: I’d like to share some historical perspective. Before there was Exadata, there was SAGE—Storage Appliance for Grid Environments, which we might consider V0. In fact, it remained SAGE until just a matter of weeks before Larry Ellison gave it the name Exadata—just in time for the Open World launch of the product in 2008 amid huge co-branded fanfare with Hewlett-Packard. Although the first embodiment of SAGE was a Hewlett-Packard exclusive, Oracle had not yet decided that the platform would be exclusive to Hewlett-Packard, much less the eventual total exclusivity enjoyed by Sun Microsystems—by way of being acquired by Oracle. In fact, Oracle leadership hadn’t even established the rigid Linux Operating System requirement for the database hosts; the porting effort of iDB to HP-UX Itanium was in very late stages of development before the Sun acquisition was finalized. But SAGE evolution went back further than that.

V1: The first Exadata was released in late 2008. It was labeled as V1 and was a combination of HP hardware and Oracle software. The architecture was similar to the current X2-2 version, with the exception of the Flash Cache, which was added to the V2 version. Exadata V1 was marketed as exclusively a data warehouse platform. The product was interesting but not widely adopted. It also suffered from issues resulting from overheating. The commonly heard description was that you could fry eggs on top of the cabinet. Many of the original V1 customers replaced their V1s with V2s.

V2: The second version of Exadata was announced at Open World in 2009. This version was a partnership between Sun and Oracle. By the time the announcement was made, Oracle was already in the process of attempting to acquire Sun Microsystems. Many of the components were upgraded to bigger or faster versions, but the biggest difference was the addition of a significant amount of solid-state based storage. The storage cells were enhanced with 384G of Exadata Smart Flash Cache. The software was also enhanced to take advantage of the new cache. This addition allowed Oracle to market the platform as more than a Data Warehouse platform opening up a significantly larger market.

X2: The third edition of Exadata, announced at Oracle Open World in 2010, was named the X2. Actually, there are two distinct versions of the X2. The X2-2 follows the same basic blueprint as the V2, with up to eight dual-CPU database servers. The CPUs were upgraded to hex-core models, where the V2s had used quad-core CPUs. The other X2 model was named the X2-8. It breaks the small 1U database server model by introducing larger database servers with 8 × 8 core CPUs and a large 1TB memory footprint. The X2-8 is marketed as a more robust platform for large OLTP or mixed workload systems due primarily to the larger number of CPU cores and the larger memory footprint.

Alternative Views of What Exadata Is

We’ve already given you a rather bland description of how we view Exadata. However, like the well-known tale of the blind men describing an elephant, there are many conflicting perceptions about the nature of Exadata. We’ll cover a few of the common descriptions in this section.

Data Warehouse Appliance

Occasionally Exadata is described as a data warehouse appliance (DW Appliance). While Oracle has attempted to keep Exadata from being pigeonholed into this category, the description is closer to the truth than you might initially think. It is, in fact, a tightly integrated stack of hardware and software that Oracle expects you to run without a lot of changes. This is directly in-line with the common understanding of a DW Appliance. However, the very nature of the Oracle database means that it is extremely configurable. This flies in the face of the typical DW Appliance, which typically does not have a lot of knobs to turn. However, there are several common characteristics that are shared between DW Appliances and Exadata.

Exceptional Performance: The most recognizable characteristic of Exadata and DW Appliances in general is that they are optimized for data warehouse type queries.

Fast Deployment: DW Appliances and Exadata Database Machines can both be deployed very rapidly. Since Exadata comes preconfigured, it can generally be up and running within a week from the time you take delivery. This is in stark contrast to the normal Oracle clustered database deployment scenario, which generally takes several weeks.

Scalability: Both platforms have scalable architectures. With Exadata, upgrading is done in discrete steps. Upgrading from a half rack configuration to a full rack increases the total disk throughput in lock step with the computing power available on the database servers.

Reduction in TCO: This one may seem a bit strange, since many people think the biggest drawback to Exadata is the high price tag. But the fact is that both DW Appliances and Exadata reduce the overall cost of ownership in many applications. Oddly enough, in Exadata’s case this is partially thanks to a reduction in the number of Oracle database licenses necessary to support a given workload. We have seen several situations where multiple hardware platforms were evaluated for running a company’s Oracle application and have ended up costing less to implement and maintain on Exadata than on the other options evaluated.

High Availability: Most DW Appliances provide an architecture that supports at least some degree of high availability (HA). Since Exadata runs standard Oracle 11g software, all the HA capabilities that Oracle has developed are available out of the box. The hardware is also designed to prevent any single point of failure.

Preconfiguration: When Exadata is delivered to your data center, a Sun engineer will be scheduled to assist with the initial configuration. This will include ensuring that the entire rack is cabled and functioning as expected. But like most DW Appliances, the work has already been done to integrate the components. So extensive research and testing are not required.

Limited Standard Configurations: Most DW Appliances only come in a very limited set of configurations (small, medium, and large, for example). Exadata is no different. There are currently only four possible configurations. This has repercussions with regards to supportability. It means if you call support and tell them you have an X2-2 Half Rack, the support people will immediately know all they need to know about your hardware. This provides benefits to the support personnel and the customers in terms of how quickly issues can be resolved.

Regardless of the similarities, Oracle does not consider Exadata to be a DW Appliance, even though there are many shared characteristics. Generally speaking, this is because Exadata provides a fully functional Oracle database platform with all the capabilities that have been built into Oracle over the years, including the ability to run any application that currently runs on an Oracle database and in particular to deal with mixed workloads that demand a high degree of concurrency, which DW Appliances are generally not equipped to handle.

images Kevin Says: Whether Exadata is or is not an appliance is a common topic of confusion when people envision what Exadata is. The Oracle Exadata Database Machine is not an appliance. However, the storage grid does consist of Exadata Storage Server cells—which are appliances.

OLTP Machine

This description is a bit of a marketing ploy aimed at broadening Exadata’s appeal to a wider market segment. While the description is not totally off-base, it is not as accurate as some other monikers that have been assigned to Exadata. It brings to mind the classic quote:

It depends on what the meaning of the word “is” is.

—Bill Clinton

In the same vein, OLTP (Online Transaction Processing) is a bit of a loosely defined term. We typically use the term to describe workloads that are very latency-sensitive and characterized by single-block access via indexes. But there is a subset of OLTP systems that are also very write-intensive and demand a very high degree of concurrency to support a large number of users. Exadata was not designed to be the fastest possible solution for these write-intensive workloads. However, it’s worth noting that very few systems fall neatly into these categories. Most systems have a mixture of long-running, throughput-sensitive SQL statements and short-duration, latency-sensitive SQL statements. Which leads us to the next view of Exadata.

Consolidation Platform

This description pitches Exadata as a potential platform for consolidating multiple databases. This is desirable from a total cost of ownership (TCO) standpoint, as it has the potential to reduce complexity (and therefore costs associated with that complexity), reduce administration costs by decreasing the number of systems that must be maintained, reduce power usage and data center costs through reducing the number of servers, and reduce software and maintenance fees. This is a valid way to view Exadata. Because of the combination of features incorporated in Exadata, it is capable of adequately supporting multiple workload profiles at the same time. Although it is not the perfect OLTP Machine, the Flash Cache feature provides a mechanism for ensuring low latency for OLTP-oriented workloads. The Smart Scan optimizations provide exceptional performance for high-throughput, DW-oriented workloads. Resource Management options built into the platform provide the ability for these somewhat conflicting requirements to be satisfied on the same platform. In fact, one of the biggest upsides to this ability is the possibility of totally eliminating a huge amount of work that is currently performed in many shops to move data from an OLTP system to a DW system so that long-running queries do not negatively affect the latency-sensitive workload. In many shops, simply moving data from one platform to another consumes more resources than any other operation. Exadata’s capabilities in this regard may make this process unnecessary in many cases.

Configuration Options

Since Exadata is delivered as a preconfigured, integrated system, there are very few options available. As of this writing there are four versions available. They are grouped into two major categories with different model names (the X2-2 and the X2-8). The storage tiers and networking components for the two models are identical. The database tiers, however, are different.

Exadata Database Machine X2-2

The X2-2 comes in three flavors: quarter rack, half rack, and full rack. The system is built to be upgradeable, so you can upgrade later from a quarter rack to half rack, for example. Here is what you need to know about the different options:

Quarter Rack: The X2-2 Quarter Rack comes with two database servers and three storage servers. The high-capacity version provides roughly 33TB of usable disk space if it is configured for normal redundancy. The high-performance version provides roughly one third of that or about 10TB of usable space, again if configured for normal redundancy.

Half Rack: The X2-2 Half Rack comes with four database servers and seven storage servers. The high-capacity version provides roughly 77TB of usable disk space if it is configured for normal redundancy. The high-performance version provides roughly 23TB of usable space if configured for normal redundancy.

Full Rack: The X2-2 Quarter Rack comes with eight database servers and fourteen storage servers. The high-capacity version provides roughly 154TB of usable disk space if it is configured for normal redundancy. The high performance version provides about 47TB of usable space if configured for normal redundancy.

images Note: Here’s how we cam up with the rough useable space estimates. We took the actual size of the disk and subtracted 29GB for OS/DBFS space. Assuming the actual disk sizes are 1,861GB and 571GB for high capacity (HC) and high performance (HP) drives, that leaves 1,833GB for HC and 543GB for HP. Multiply that by the number of disks in the rack (36, 84, or 168). Divide that number by 2 or 3 depending on whether you are using normal or high redundancy to get usable space. Keep in mind that the “usable free mb” that asmcmd reports takes into account the space needed for a rebalance if a failgroup was lost (req_mir_free_MB). Usable file space from asmcmd's lsdg is calculated as follows:

Free_MB / redundancy - (req_mir_free_MB / 2)

Half and full racks are designed to be connected to additional racks, enabling multiple-rack configurations. These configurations have an additional InfiniBand switch called a spine switch. It is intended to be used to connect additional racks. There are enough available connections to connect as many as eight racks, although additional cabling may be required depending on the number of racks you intend to connect. The database servers of the multiple racks can be combined into a single RAC database with database servers that span racks, or they may be used to form several smaller RAC clusters. Chapter 15 contains more information about connecting multiple racks.

Exadata Database Machine X2-8

There is currently only one version of the X2-8. It has two database servers and fourteen storage cells. It is effectively an X2-2 Full Rack but with two large database servers instead of the eight smaller database servers used in the X2-2. As previously mentioned, the storage servers and networking components are identical to the X2-2 model. There are no upgrades specific to x2-8 available. If you need more capacity, your option is to add another X2-8, although it is possible to add additional storage cells.

Upgrades

Quarter racks and half racks may be upgraded to add more capacity. The current price list has two options for upgrades, the Half Rack To Full Rack Upgrade and the Quarter Rack to Half Rack Upgrade. The options are limited in an effort to maintain the relative balance between database servers and storage servers. These upgrades are done in the field. If you order an upgrade, the individual components will be shipped to your site on a big pallet and a Sun engineer will be scheduled to install the components into your rack. All the necessary parts should be there, including rack rails and cables. Unfortunately, the labels for the cables seem to come from some other part of the universe. When we did the upgrade on our lab system, the lack of labels held us up for a couple of days.

The quarter-to-half upgrade includes two database servers and four storage servers along with an additional InfiniBand switch, which is configured as a spine switch. The half-to-full upgrade includes four database servers and seven storage servers. There is no additional InfiniBand switch required, because the half rack already includes a spine switch.

There is also the possibility of adding standalone storage servers to an existing rack. Although this goes against the balanced configuration philosophy, Oracle does allow it. Oddly enough, they do not support placing the storage servers in the existing rack, even if there is space (as in the case of a quarter rack or half rack for example).

There are a couple of other things worth noting about upgrades. Many companies purchased Exadata V2 systems and are now in the process of upgrading those systems. Several questions naturally arise with regard to this process. One has to do with whether it is acceptable to mix the newer X2-2 servers with the older V2 components. The answer is yes, it’s OK to mix them. In our lab environment, for example, we have a mixture of V2 (our original quarter rack) and X2-2 servers (the upgrade to a half rack). We chose to upgrade our existing system to a half rack rather than purchase another standalone quarter rack with X2-2 components, which was another viable option.

The other question that comes up frequently is whether adding additional standalone storage servers is an option for companies that are running out of space but that have plenty of CPU capacity on the database servers. This question is not as easy to answer. From a licensing standpoint, Oracle will sell you additional storage servers, but remember that one of the goals of Exadata was to create a more balanced architecture. So you should carefully consider whether you need more processing capability at the database tier to handle the additional throughput provided by the additional storage. However, if it’s simply lack of space that you are dealing with, additional storage servers are certainly a viable option.

Hardware Components

You’ve probably seen many pictures like the one in Figure 1-2. It shows an Exadata Database Machine Full Rack. We’ve added a few graphic elements to show you where the various pieces reside in the cabinet. In this section we’ll cover those pieces.

images

Figure 1-2. An Exadata Full Rack

As you can see, most of the networking components, including an Ethernet switch and two redundant InfiniBand switches, are located in the middle of the rack. This makes sense as it makes the cabling a little simpler. There is also a Sun Integrated Lights Out Manager (ILOM) module and KVM in the center section. The surrounding eight slots are reserved for database servers, and the rest of the rack is used for storage servers, with one exception. The very bottom slot is used for an additional InfiniBand “spine” switch that can be used to connect additional racks if so desired. It is located in the bottom of the rack, based on the expectation that your Exadata will be in a data center with a raised floor, allowing cabling to be run from the bottom of the rack.

Operating Systems

The current generation X2 hardware configurations use Intel-based Sun servers. As of this writing all the servers come preinstalled with Oracle Linux 5. Oracle has announced that they intend to support two versions of the Linux kernel—the standard Redhat-compatible version and an enhanced version called the Unbreakable Enterprise Kernel (UEK). This optimized version has several enhancements that are specifically applicable to Exadata. Among these are network-related improvements to InfiniBand using the RDS protocol. One of the reasons for releasing the UEK may be to speed up Oracle’s ability to roll out changes to Linux by avoiding the lengthy process necessary to get changes into the standard Open Source releases. Oracle has been a strong partner in the development of Linux and has made several major contributions to the code base. The stated direction is to submit all the enhancements included in the EUK version for inclusion in the standard release.

Oracle has also announced that the X2 database servers will have the option of running Solaris 11 Express. And speaking of Solaris, we are frequently asked about whether Oracle has plans to release a version of Exadata that uses SPARC CPUs. At the time of this writing, there has been no indication that this will be a future direction. It seems more likely that Oracle will continue to pursue the X86-based solution.

Storage servers for both the X2-2 and X2-8 models will continue to run exclusively on Oracle Linux. Oracle views these servers as a closed system and does not support installing any additional software on them.

Database Servers

The current generation X2-2 database servers are based on the Sun Fire X4170 M2 servers. Each server has two × 6 Core Intel Xeon X5670 processors (2.93 GHz) and 96GB of memory. They also have four internal 300GB 10K RPM SAS drives. They have several network connections including two 10Gb and four 1Gb Ethernet ports in addition to the two QDR InfiniBand (40Gb/s) ports. Note that the 10Gb ports are open and that you’ll need to provide the correct connectors to attach them to your existing copper or fiber network. The servers also have a dedicated ILOM port and dual hot-swappable power supplies.

The X2-8 database servers are based on the Sun Fire X4800 servers. They are designed to handle systems that require a large amount of memory. The servers are equipped with eight x 8 Core Intel Xeon X7560 processors (2.26 GHz) and 1 TB of memory. This gives the full rack system a total of 128 cores and 2 terabytes of memory.

Storage Servers

The current generation of storage servers are the same for both the X2-2 and the X2-8 models. Each storage server consists of a Sun Fire X4270 M2 and contains 12 disks. Depending on whether you have the high-capacity version or the high-performance version, the disks will either be 2TB or 600GB SAS drives. Each storage server comes with 24GB of memory and two x 6 Core Intel Xeon X5670 processors running at 2.93 GHz. These are the same CPUs as on the X2-2 database servers. Because these CPUs are in the Westmere family, they have built in AES encryption support, which essentially provides a hardware assist to encryption and decryption. Each storage server also contains four 96GB Sun Flash Accelerator F20 PCIe cards. This provides a total of 384GB of flash based storage on each storage cell. The storage servers come pre-installed with Oracle Linux 5.

InfiniBand

One of the more important hardware components of Exadata is the InfiniBand network. It is used for transferring data between the database tier and the storage tier. It is also used for interconnect traffic between the database servers, if they are configured in a RAC cluster. In addition, the InfiniBand network may be used to connect to external systems for such uses as backups. Exadata provides redundant 36-port QDR InfiniBand switches for these purposes. The switches provide 40 Gb/Sec of throughput. You will occasionally see these switches referred to as “leaf” switches. In addition, each database server and each storage server are equipped with Dual-Port QDR InfiniBand Host Channel Adapters. All but the smallest (quarter rack) Exadata configurations also contain a third InfiniBand switch, intended for chaining multiple Exadata racks together. This switch is generally referred to as a “spine” switch.

Flash Cache

As mentioned earlier, each storage server comes equipped with 384GB of flash-based storage. This storage is generally configured to be a cache. Oracle refers to it as Exadata Smart Flash Cache (ESFC). The primary purpose of ESFC is to minimize the service time for single block reads. This feature provides a substantial amount of disk cache, about 2.5TB on a half rack configuration.

Disks

Oracle provides two options for disks. An Exadata Database Machine may be configured with either high-capacity drives or high-performance drives. As previously mentioned, the high-capacity option includes 2TB, 7200 RPM drives, while the high-performance option includes 600GB, 15000 RPM SAS drives. Oracle does not allow a mixture of the two drive types. With the large amount of flash cache available on the storage cells, it seems that the high-capacity option would be adequate for most read heavy workloads. The flash cache does a very good job of reducing the single-block-read latency in the mixed-workload systems we’ve observed to date.

Bits and Pieces

The package price includes a 42U rack with redundant power distribution units. Also included in the price is an Ethernet switch. The spec sheets don’t specify the model for the Ethernet switch, but as of this writing they are shipping a switch manufactured by Cisco. To date, this is the one piece of the package that Oracle has agreed to allow customers to replace. If you have another switch that you like better, you can remove the included switch and replace it (at your own cost). The X2-2 includes a KVM unit as well. The package price also includes a spares kit that includes an extra flash card, an extra disk drive, and some extra InfiniBand cables (two extra flash cards and two extra disk drives on full racks). The package price does not include SFP+ connectors or cables for the 10GB Ethernet ports. These are not standard and will vary based on the equipment used in your network. The ports are intended for external connections of the database servers to the customer’s network.

Software Components

The software components that make up Exadata are split between the database tier and the storage tier. Standard Oracle database software runs on the database servers, while Oracle’s relatively new disk management software runs on the storage servers. The components on both tiers use a protocol called iDB to talk to each other. The next two sections provide a brief introduction to the software stack that resides on both tiers.

Database Server Software

As previously discussed, the database servers run Oracle Linux. Of course there is the option to run Solaris Express, but as of this writing we have not seen one running Solaris.

The database servers also run standard Oracle 11g Release 2 software. There is no special version of the database code that is different from the code that is run on any other platform. This is actually a unique and significant feature of Exadata, compared to competing data warehouse appliance products. In essence, it means that any application that can run on Oracle 11gR2 can run on Exadata without requiring any changes to the application. While there is code that is specific to the Exadata platform, iDB for example, Oracle chose to make it a part of the standard distribution. The software is aware of whether it is accessing Exadata storage, and this “awareness” allows it to make use of the Exadata-specific optimizations when accessing Exadata storage.

ASM (Oracle Automatic Storage Management) is a key component of the software stack on the database servers. It provides file system and volume management capability for Exadata storage. It is required because the storage devices are not visible to the database servers. There is no direct mechanism for processes on the database servers to open or read a file on Exadata storage cells. ASM also provides redundancy to the storage by mirroring data blocks, using either normal redundancy (two copies) or high redundancy (three copies). This is an important feature because the disks are physically located on multiple storage servers. The ASM redundancy allows mirroring across the storage cells, which allows for the complete loss of a storage server without an interruption to the databases running on the platform. There is no form of hardware or software based RAID that protects the data on Exadata storage servers. The mirroring protection is provided exclusively by ASM.

While RAC is generally installed on Exadata database servers, it is not actually required. RAC does provide many benefits in terms of high availability and scalability though. For systems that require more CPU or memory resources than can be supplied by a single server, RAC is the path to those additional resources.

The database servers and the storage servers communicate using the Intelligent Database protocol (iDB). iDB implements what Oracle refers to as a function shipping architecture. This term is used to describe how iDB ships information about the SQL statement being executed to the storage cells and then returns processed data (prefiltered, for example), instead of data blocks, directly to the requesting processes. In this mode, iDB can limit the data returned to the database server to only those rows and columns that satisfy the query. The function shipping mode is only available when full scans are performed. iDB can also send and retrieve full blocks when offloading is not possible (or not desirable). In this mode, iDB is used like a normal I/O protocol for fetching entire Oracle blocks and returning them to the Oracle buffer cache on the database servers. For completeness we should mention that it is really not a simple one way or the other scenario. There are cases where we can get a combination of these two behaviors. We’ll discuss that in more detail in Chapter 2.

iDB uses the Reliable Datagram Sockets (RDS) protocol and of course uses the InfiniBand fabric between the database servers and storage cells. RDS is a low-latency, low-overhead protocol that provides a significant reduction in CPU usage compared to protocols such as UDP. RDS has been around for some time and predates Exadata by several years. The protocol implements a direct memory access model for interprocess communication, which allows it to avoid the latency and CPU overhead associated with traditional TCP traffic.

images Kevin Says: RDS has indeed been around for quite some time, although not with the Exadata use case in mind. The history of RDS goes back to the partnering between SilverStorm (acquired by Qlogic Corporation) and Oracle to address the requirements for low latency and high bandwidth placed upon the Real Application Clusters node interconnect (via libskgxp) for DLM lock traffic and, to a lesser degree, for Parallel Query data shipping. The latter model was first proven by a 1TB scale TPC-H conducted with Oracle Database 10g on the now defunct PANTASystems platform. Later Oracle aligned itself more closely with Mellanox.

This history lesson touches on an important point. iDB is based on libskgxp, which enjoyed many years of hardening in its role of interconnect library dating back to the first phase of the Cache Fusion feature in Oracle8i. The ability to leverage a tried and true technology like libskgxp came in handy during the move to take SAGE to market.

It is important to understand that no storage devices are directly presented to the operating systems on the database servers. Therefore, there are no operating-system calls to open files, read blocks from them, or the other usual tasks. This also means that standard operating-system utilities like iostat will not be useful in monitoring your database servers, because the processes running there will not be issuing I/O calls to the database files. Here’s some output that illustrates this fact:

KSO@SANDBOX1> @whoami

USERNAME              SID     SERIAL# PREV_HASH_VALUE SCHEMANAME  OS_PID
--------------- ---------- ---------- --------------- ---------- -------
KSO                    689        771      2334772408 KSO          23922

KSO@SANDBOX1> select /* avgskew3.sql */ avg(pk_col) from kso.skew3 a where col1 > 0;

...

> strace -cp 23922
Process 23922 attached - interrupt to quit
Process 23922 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 49.75    0.004690           0     10902      5451 setsockopt
 29.27    0.002759           0      6365           poll
 11.30    0.001065           0      5487           sendmsg
  9.60    0.000905           0     15328      4297 recvmsg
  0.08    0.000008           1        16           fcntl
  0.00    0.000000           0        59           read
  0.00    0.000000           0         3           write
  0.00    0.000000           0        32        12 open
  0.00    0.000000           0        20           close
  0.00    0.000000           0         4           stat
  0.00    0.000000           0         4           fstat
  0.00    0.000000           0        52           lseek
  0.00    0.000000           0        33           mmap
  0.00    0.000000           0         7           munmap
  0.00    0.000000           0         1           semctl
  0.00    0.000000           0        65           getrusage
  0.00    0.000000           0        32           times
  0.00    0.000000           0         1           semtimedop
------ ----------- ----------- --------- --------- ----------------
100.00    0.009427                 38411      9760 total

In this listing we have run strace on a user’s foreground process (sometimes called a shadow process). This is the process that’s responsible for retrieving data on behalf of a user. As you can see, the vast majority of system calls captured by strace are network-related (setsockopt, poll, sendmsg, and recvmsg). By contrast, on a non-Exadata platform we mostly see disk I/O-related events, primarily some form of the read call. Here’s some output from a non-Exadata platform for comparison:

KSO@LAB112> @whoami

USERNAME              SID     SERIAL# PREV_HASH_VALUE SCHEMANAME  OS_PID
--------------- ---------- ---------- --------------- ---------- -------
KSO                    249      32347      4128301241 KSO          22493

KSO@LAB112> @avgskew

AVG(PK_COL)
-----------
 16093749.8

...

[root@homer ~]# strace -cp 22493
Process 22493 attached - interrupt to quit
Process 22493 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 88.86    4.909365        3860      1272           pread64
 10.84    0.599031          65      9171           gettimeofday
  0.16    0.008766          64       136           getrusage
  0.04    0.002064          56        37           times
  0.02    0.001378         459         3           write
  0.02    0.001194         597         2           statfs
  0.02    0.001150         575         2           fstatfs
  0.02    0.001051         350         3           read
  0.01    0.000385          96         4           mmap2
  0.00    0.000210         105         2           io_destroy
  0.00    0.000154          77         2           io_setup
  0.00    0.000080          40         2           open
  0.00    0.000021          11         2           fcntl64
------ ----------- ----------- --------- --------- ----------------
100.00    5.524849                 10638           total

Notice that the main system call captured on the non-Exadata platform is I/O-related (pread64). The point of the previous two listings is to show that there is a very different mechanism in play in the way data stored on disks is accessed with Exadata.

Storage Server Software

Cell Services (cellsrv) is the primary software that runs on the storage cells. It is a multi-threaded program that services I/O requests from a database server. Those requests can be handled by returning processed data or by returning complete blocks depending in the request. cellsrv also implements the Resource Manager defined I/O distribution rules, ensuring that I/O is distributed to the various databases and consumer groups appropriately.

There are two other programs that run continuously on Exadata storage cells. Management Server (MS) is a Java program that provides the interface between cellsrv and the Cell Command Line Interface (cellcli) utility. MS also provides the interface between cellsrv and the Grid Control Exadata plug-in (which is implemented as a set of cellcli commands that are run via rsh). The second utility is Restart Server (RS). RS is actually a set of processes that is responsible for monitoring the other processes and restarting them if necessary. OSWatcher is also installed on the storage cells for collecting historical operating system statistics using standard Unix utilities such as vmstat and netstat. Note that Oracle does not authorize the installation of any additional software on the storage servers.

One of the first things you are likely to want to do when you first encounter Exadata is to log on to the storage cells and see what’s actually running. Unfortunately, the storage servers are generally off-limits to everyone except the designated system administers or DBAs. Here’s a quick listing showing the output generated by a ps command on an active storage server:

> ps -eo ruser,pid,ppid,cmd

RUSER      PID  PPID CMD
root     12447     1 /opt/oracle/.../cellsrv/bin/cellrssrm -ms 1 -cellsrv 1
root     12453 12447 /opt/oracle/.../cellsrv/bin/cellrsbmt -ms 1 -cellsrv 1
root     12454 12447 /opt/oracle/.../cellsrv/bin/cellrsmmt -ms 1 -cellsrv 1
root     12455 12447 /opt/oracle/.../cellsrv/bin/cellrsomt -ms 1 -cellsrv 1
root     12456 12453 /opt/oracle/.../bin/cellrsbkm
                     -rs_conf /opt/oracle/.../cellsrv/deploy/config/cellinit.ora
                     -ms_conf /opt/oracle/cell
root     12457 12454 /usr/java/jdk1.5.0_15//bin/java -Xms256m -Xmx512m
                     -Djava.library.path=/opt/oracle/.../cellsrv/lib
                     -Ddisable.checkForUpdate=true -jar /opt/oracle/cell11.2
root     12460 12456 /opt/oracle/.../cellsrv/bin/cellrssmt
                     -rs_conf /opt/oracle/.../cellsrv/deploy/config/cellinit.ora
                     -ms_conf /opt/oracle/cell
root     12461 12455 /opt/oracle/.../cellsrv/bin/cellsrv 100 5000 9 5042
root     12772 22479 /usr/bin/mpstat 5 720
root     12773 22479 bzip2 --stdout
root     17553     1 /bin/ksh ./OSWatcher.sh 15 168 bzip2
root     20135 22478 /usr/bin/top -b -c -d 5 -n 720
root     20136 22478 bzip2 --stdout
root     22445 17553 /bin/ksh ./OSWatcherFM.sh 168
root     22463 17553 /bin/ksh ./oswsub.sh HighFreq ./Exadata_vmstat.sh
root     22464 17553 /bin/ksh ./oswsub.sh HighFreq ./Exadata_mpstat.sh
root     22465 17553 /bin/ksh ./oswsub.sh HighFreq ./Exadata_netstat.sh
root     22466 17553 /bin/ksh ./oswsub.sh HighFreq ./Exadata_iostat.sh
root     22467 17553 /bin/ksh ./oswsub.sh HighFreq ./Exadata_top.sh
root     22471 17553 /bin/bash /opt/oracle.cellos/ExadataDiagCollector.sh
root     22472 17553 /bin/ksh ./oswsub.sh HighFreq
                              /opt/oracle.oswatcher/osw/ExadataRdsInfo.sh
root     22476 22463 /bin/bash ./Exadata_vmstat.sh HighFreq
root     22477 22466 /bin/bash ./Exadata_iostat.sh HighFreq
root     22478 22467 /bin/bash ./Exadata_top.sh HighFreq
root     22479 22464 /bin/bash ./Exadata_mpstat.sh HighFreq
root     22480 22465 /bin/bash ./Exadata_netstat.sh HighFreq
root     22496 22472 /bin/bash /opt/oracle.oswatcher/osw/ExadataRdsInfo.sh HighFreq

So as you can see, there are a number of processes that look like cellrsvXXX. These are the processes that make up the Restart Server. Also notice the first bolded process; this is the Java program that we refer to as Management Server. The second bolded process is cellsrv itself. Finally, you’ll see several processes associated with OSWatcher. Note also that all the processes are started by root. While there are a couple of other semi-privileged accounts on the storage servers, it is clearly not a system that is setup for users to log on to.

Another interesting way to look at related processes is to use the ps −H command, which provides an indented list of processes showing how they are related to each other. You could work this out for yourself by building a tree based on the relationship between the process ID (PID) and parent process ID (PPID) in the previous listing, but the −H option makes that a lot easier. Here’s an edited snippet of output from a ps −H command:

cellrssrm <= main Restart Server
   cellrsbmt
      cellrsbkm
         cellrssmt
   cellrsmmt
      java - .../oc4j/ms/j2ee/home/oc4j.jar <= Management Server
         cellrsomt
            cellsrv

It’s also interesting to see what resources are being consumed on the storage servers. Here’s a snippet of output from top:

top - 18:20:27 up 2 days,  2:09,  1 user,  load average: 0.07, 0.15, 0.16
Tasks: 298 total,   1 running, 297 sleeping,   0 stopped,   0 zombie
Cpu(s):  6.1%us,  0.6%sy,  0.0%ni, 93.30%id,  0.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  24531712k total, 14250280k used, 10281432k free,   188720k buffers
Swap:  2096376k total,        0k used,  2096376k free,   497792k cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
12461 root      18   0 17.0g 4.5g  11m S 105.9 19.2  55:20.45 cellsrv
    1 root      18   0 10348  748  620 S   0.0  0.0   0:02.79 init
    2 root      RT  -5     0    0    0 S   0.0  0.0   0:00.14 migration/0
    3 root      34  19     0    0    0 S   0.0  0.0   0:01.45 ksoftirqd/0
    4 root      RT  -5     0    0    0 S   0.0  0.0   0:00.00 watchdog/0

The output from top shows that cellsrv is using more than one full CPU core. This is common on busy systems and is due to the multi-threaded nature of the cellsrv process.

Software Architecture

In this section we’ll briefly discuss the key software components and how they are connected in the Exadata architecture. There are components that run on both the database and the storage tiers. Figure 1-3 depicts the overall architecture of the Exadata platform.

images

Figure 1-3. Exadata architecture diagram

The top half of the diagram shows the key components on one of the database servers, while the bottom half shows the key components on one of the storage servers. The top half of the diagram should look pretty familiar, as it is standard Oracle 11g architecture. It shows the System Global Area (SGA), which contains the buffer cache and the shared pool. It also shows several of the key processes, such as Log Writer (LGWR) and Database Writer (DBWR). There are many more processes, of course, and much more detailed views of the shared memory that could be provided, but this should give you a basic picture of how things look on the database server.

The bottom half of the diagram shows the components on one of the storage servers. The architecture on the storage servers is pretty simple. There is really only one process (cellsrv) that handles all the communication to and from the database servers. There are also a handful of ancillary processes for managing and monitoring the environment.

One of the things you may notice in the architecture diagram is that cellsrv uses an init.ora file and has an alert log. In fact, the storage software bears a striking resemblance to an Oracle database. This shouldn’t be too surprising. The cellinit.ora file contains a set of parameters that are evaluated when cellsrv is started. The alert log is used to write a record of notable events, much like an alert log on an Oracle database. Note also that Automatic Diagnostic Repository (ADR) is included as part of the storage software for capturing and reporting diagnostic information.

Also notice that there is a standalone process that is not attached to any database instance (DISKMON), which performs several tasks related to Exadata Storage. Although it is called DISKMON, it is really a network- and cell-monitoring process that checks to verify that the cells are alive. DISKMON is also responsible to propagating Database Resource Manager (DBRM) plans to the storage servers. DISKMON also has a single slave process per instance, which is responsible for communicating between ASM and the database it is responsible for.

The connection between the database server and the storage server is provided by the InfiniBand fabric. All communication between the two tiers is carried by this transport mechanism. This includes writes via the DBWR processes and LGWR process and reads carried out by the user foreground (or shadow) processes.

Figure 1-4 provides another view of the architecture, which focuses on the software stack and how it spans multiple servers in both the database grid and the storage grid.

images

Figure 1-4. Exadata software architecture

As we’ve discussed, ASM is a key component. Notice that we have drawn it as an object that cuts across all the communication lines between the two tiers. This is meant to indicate that ASM provides the mapping between the files and the objects that the database knows about on the storage layer. ASM does not actually sit between the storage and the database, though, and it is not a layer in the stack that the processes must touch for each “disk access.”

Figure 1-4 also shows the relationship between Database Resource Manager (DBRM) running on the instances on the database servers and I/O Resource Manager (IORM), which is implemented inside cellsrv running on the storage servers.

The final major component in Figure 1-4 is LIBCELL, which is a library that is linked with the Oracle kernel. LIBCELL has the code that knows how to request data via iDB. This provides a very nonintrusive mechanism to allow the Oracle kernel to talk to the storage tier via network-based calls instead of operating system reads and writes. iDB is implemented on top of the Reliable Datagram Sockets (RDS) protocol provided by the OpenFabrics Enterprise Distribution. This is a low-latency, low-CPU-overhead protocol that provides interprocess communications. You may also see this protocol referred to in some of the Oracle marketing material as the Zero-loss Zero-copy (ZDP) InfiniBand protocol. Figure 1-5 is a basic schematic showing why the RDS protocol is more efficient than using a traditional TCP based protocol like UDP.

images

Figure 1-5. RDS schematic

As you can see from the diagram, using the RDS protocol to bypass the TCP processing cuts out a portion of the overhead required to transfer data across the network. Note that the RDS protocol is also used for interconnect traffic between RAC nodes.

Summary

Exadata is a tightly integrated combination of hardware and software. There is nothing magical about the hardware components themselves. The majority of the performance benefits come from the way the components are integrated and the software that is implemented at the storage layer. In the next chapter we’ll dive into the offloading concept, which is what sets Exadata apart from all other platforms that run Oracle databases.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.74.66