CHAPTER  17

Image

Big Data for the SQL Server DBA

By Carlos Bossy

By now you’ve all heard about “Big Data,” and the way people talk about it, you might think it’s as overwhelmingly large as the universe. If you’re like most database administrators, you likely already work with data that is big. You probably spend your professional time managing multiple databases that are terabytes in size or larger, and continue to grow. The size of data has exceeded Moore’s Law, more than doubling every 24 months and making the DBAs job a challenge, to say the least. There’s never been anything small about what you do. So, why is data now “big”?

This chapter will explain the meaning and significance of Big Data by comparing and contrasting it to the SQL Server relational database environment. It will also demonstrate the value a Big Data ecosystem can provide to your technology infrastructure by using it to complement your existing data architecture and having it co-exist in effective ways with your SQL Server databases. If hearing about Big Data has mystified you and made you wonder if it is nothing more than irrational exuberance, by the end of this chapter you’ll have a better sense of how to make the best use of it and why it could benefit you and your organization.

For SQL Server–centric DBAs, what exactly does Big Data mean? It’s still a nebulous area, not yet well-defined and without accepted practices or standards. If there wasn’t already enough to know in the demanding job of a DBA, learning, understanding, and managing yet another technology can certainly add to an already stressful workday. Throughout this chapter, I’ll present some definitions and descriptions that gradually introduce Big Data, present several business cases for taking advantage of Big Data, and show some practical working examples of actual code you can execute in Big Data database environments.

If you’re not certain what Big Data is, you’re not alone. Putting a wrapper around it isn’t easy, and there is no precise industry standard definition. So, let’s start by looking at some statements about Big Data that should provide some clarity.

Image Note Big Data is a class of data that requires CPU and disk resources that exceed the power of stand-alone computer hardware to be retrieved efficiently. Processing it effectively requires multiple distributed computers working in parallel. This data is difficult to process in a traditional relational database management system (RDBMS) . Therefore, it is beneficial to define the schema for this data at retrieval time, after it is stored on disk.

The ”three Vs” that define Big Data are volume of data measured in at least terabytes and into petabytes; variety of data that includes media files, images, text, streams, logs, sensors, and more; and velocity of data, which is the speed with which it is collected.

For the SQL Server DBA, Big Data means managing and processing data that forces you to think beyond SQL Server’s ample set of features. This approach includes solutions that are difficult to implement with the core database engine, partitioning, Service Broker, Analysis Services, and many other tools you commonly utilize to make SQL Server process high volumes of assorted data types.

A Hadoop implementation of Big Data is a mechanism for handling data that can and must be distributed for massive scalability beyond what SQL Server can do (and more than it was meant to do). Hadoop is also good at handling large data types, data without a well-defined schema, and data that is generally difficult for traditional databases, such as SQL Server, to handle. Big Data is about massive volumes of data, and a Hadoop infrastructure may be in your future if you have to deal with any of the following types of data:

  • Unstructured data such as images, video, and free-form text
  • Data you would have thrown away to save space or maintain performance
  • Data you might need later for analysis, audits, or for an unplanned purpose
  • Anything you don’t want in your OLTP databases or data marts because it doesn’t really belong there
  • Data that doesn’t relate precisely to any other data

Without immense data volumes or special data types, it’s likely that a traditional SQL Server database solution will suit you just fine and will handle your data adequately without the complexity of introducing new technology. SQL Server OLTP relational databases start transactions, get data, lock it, change it, and commit the changes to the database. This is a proven, reliable, and well-understood process that has been developed and fine-tuned since the 1970s.

In addition to relational database technology, you also have other options for storing data. An Analysis Services cube and Tabular models are great for reporting, ad-hoc analysis, dashboards, and other metrics-based visualization. You can also partition data, use file tables, scale out with Service Broker, and more, making every data architecture uniquely its own. These techniques provide you with a multitude of choices, with the arrival of Big Data being one more weapon in your arsenal.

Big Data is nonrelational, yet the relational data model has proven itself to be very useful for more than 30 years. It’s easy for technologists to understand, and if you understand the model, it’s a good bet that you understand the application. It describes the entities in the database in a very familiar way; as we talk about the entities, we also describe the real world. A customer has an address. A store contains products. A supervisor is an employee. Developers can easily interpret its visual representation, and then use that as a guide for how to handle data within an application. For example, you can scan a data model diagram and immediately get a great understanding of the data. You could even post this diagram on a wall for reference and discussion.

The drawback of implementing data architecture primarily with the relational engine as a comprehensive, one-size-fits-all solution is that you often have difficulty making every data model work the way you would like. Often, you might use the relational engine for data that doesn’t quite fit, but you do so because it’s what you have available and it’s what you know. To use all of SQL Server correctly can take a highly experienced professional who has expertise with every feature, property, and setting of the DBMS to get it done. For most professionals in the SQL Server world, the relational engine is a typical and undeniable solution for all data types and structures, including logs, archives, BLOBs, text, star schemas, staging data, and streaming data. That has now changed with Big Data. You’ll have to choose where data will reside and dropping it in an OLTP database won’t necessarily be the only (or default) choice. This is a new realm that the enterprise DBA will have to work within as they develop this architecture.

Should all of your data be stored in a Big Data cluster? It could be, but in a Microsoft SQL Server environment, the answer is very clear, “Never!” Remember, you still have SQL Server and it meets most (or all) of your needs already. It’s a solid, reliable, full-featured package with a quality database engine, and any ACID (atomicity, consistency, isolation, durability) requirements you have in a high-volume environment will still be better served by SQL Server.

Big Data Arrives with Hadoop

Now that we’ve established that Big Data doesn’t replace SQL Server or your database infrastructure but complements it, let’s look at the software and hardware that you will install, configure, and work with, and business cases that can make it useful for you. When you make the leap to Big Data, you are committing to developing a new type of data architecture that typically involves adding a Hadoop cluster to your environment, either internally within your network or in the cloud. This chapter will work with Hadoop on Azure, the cloud-based solution available from Microsoft in 2012. As of this writing, Microsoft has provided a Community Technology Preview (CTP) edition that we’ll make use of to show working code. As a follow-on product, Microsoft will release Hadoop for Windows sometime in the future, but because this edition isn’t publicly available yet, we won’t discuss it here.

Hadoop is a framework, or an ecosystem, not a database. The framework is made up of various pieces of software that allow for the distributed, parallel, and redundant storing of data. The key parts of Hadoop are HDFS and MapReduce.

  • HDFS: The Hadoop Distributed File System is the main file storage system used by the Hadoop framework software and applications.
  • MapReduce: This is a software framework for processing large amounts of data in parallel on a cluster of servers in a reliable, fault-tolerant manner.

Hadoop gets its processing power by being able to write data in parallel to a large set of distributed machines, which in the Hadoop ecosystem are called nodes. When we talk about a large set of machines, consider that a small cluster might have 50 nodes. A cluster of this size would have all 50 machines processing data simultaneously in a coordinated manner. The redundancy of the data is baked in to this process, as it works by writing every piece of data to three nodes simultaneously.

A Hadoop cluster has three types of machines: Clients, Master Nodes, and Slave Nodes. There are two types of Master Nodes: Job Trackers and Name Nodes. The Name Node coordinates the data storage in HDFS, while the Job Tracker organizes the parallel processing of data using MapReduce. The majority of the machines in a Hadoop deployment are Slave Nodes, which do the work of storing data and running computations. Each Slave runs both a Data Node and Task Tracker daemon that communicate with and receive instructions from their Master Nodes. The responsibility of the Client machines is to load data into the cluster, submit MapReduce jobs, and retrieve the results of the jobs when they finish.

Figure 17-1 shows the Client–Master–Slave relationship and how the Name Node works with the data nodes to store and get data in HDFS. In smaller clusters of 50 nodes or less, you could have a single physical server playing multiple roles, such as both Job Tracker and Name Node. With medium to large clusters, you will often have each role operating on a single server machine.

Image

Figure 17-1. A Hadoop cluster with N data nodes

MapReduce: The Nucleus of Hadoop

The reason for Hadoop’s surge in popularity is its ability to retrieve and process large amounts of data faster than any other means available, and MapReduce is the key to making this happen. Consequently, the best way to understand Hadoop is to know how MapReduce works. To do this, you’ll walk through an example of how MapReduce crunches data to solve a high-volume problem. Although writing code for MapReduce programs won’t typically be the DBA’s job, knowing how the data is processed will be as important to them as knowing how transactions are handled by a web application.

Image Note Hadoop is written in the Java programming language, so MapReduce code is most often written in Java even though there is support for other languages, such as C++ and Python. An interesting development introduced by Microsoft to the Hadoop ecosystem is the addition of JavaScript as a MapReduce programming language. You’ll use JavaScript to walk through an example of MapReduce in this section, and later in the chapter you’ll execute a MapReduce job with your own code in the “JavaScript” section. For many software and database developers with experience using Microsoft development tools, JavaScript will be familiar and a viable option for MapReduce development.

To see how MapReduce processes data, you’ll now work through a business scenario of an online retailer that is doing more than $2 billion in sales per year and has accumulated a large volume of sales data over the past three years. Assume the number of sales transactions processed over the three-year period has grown to more than 1 billion, and the company has used Hadoop to store a record of each transaction. Some of the data recorded for each transaction is shown in Table 17-1. The full set of data is spread across a cluster of 100 data nodes.

Keeping a record of the data in so much detail allows you to retrieve and analyze it in any way you choose. You know a lot about each sale, including customer details, product characteristics, and promotion information. The data is stored in CSV format by Hadoop in HDFS, so you can analyze this data over the full three-year period by computing sales metrics in numerous ways using the geography, product, promotion, customer, and sales data in your dataset. For your first task, you’ll calculate the average sale amount by product line.

Image

MapReduce has two important functions: map and reduce. The map function takes as input a dataset, filters out any data that is dirty or unwanted, and assigns a user-defined key to each row before outputting it. The Hadoop framework groups the data together by the user-defined key and sorts it before sending it to the reduce function. In this example, the key you’ll use is Product Line. The only other data you’ll need is Sale Amount, so ignore the rest. Here’s a quick explanation of the important terms:

  • MapReduce Job: This is a process run by the MapReduce framework that includes input data, custom MapReduce code, and configuration settings. One MapReduce job executes both map tasks and reduce tasks. The MapReduce work flow is shown in Figure 17-2.
  • Map Task: The input to a MapReduce job is split into equal-sized units of work called map tasks by the framework. Each task processes a subset of a whole dataset and ideally executes on a node that contains that subset on its local disk.
  • Reduce Task: Reduce tasks are made up of custom code and configuration settings that take as input the output of the map tasks. The map tasks group and sort the data by key and their output is fed into reduce tasks for further processing.
  • Map Function: This is the custom program code that is executed by a map task whose main purpose is to filter data and identify a key field.
  • Reduce Function: This is the custom program code that is executed by a reduce task that normally performs aggregations and other calculations.

Image Note The number of map tasks does not have to equal the number of reduce tasks. Often, there will be multiple map tasks and only one reduce task if the Hadoop framework decides that using a single reduce task is the most efficient way to process the data. You can limit the number of map and reduce tasks running in parallel per data node (which by default is set to 2 and 1, respectively) by editing the mapred-site.xml config file under the conf directory of Hadoop. These parameters set only an upper boundary, and the Hadoop framework may use less tasks than the maximum if it determines it can complete the job with less resources.

Image

Figure 17-2. MapReduce work flow

The data is read from the file system as input to the map function, and is passed in as a key-value pair with the key being a file offset for each row. The file offset isn’t important to you here, so ignore the input in this example. Ordinarily, the file offset will never be useful because you’re not interested in the data’s location within the file. However, the value is important here, as it’s the actual data that you want to process. The input to the map function will have complete rows, so it’s the MapReduce developer’s job to identify the data you’re interested in and parse the input so you retrieve only the Product Line and Sale Amount. Table 17-2 has an example of data that is being sent to the map function as input.

Image

At this point in the process, the developer can also look for dirty data by checking for nulls or other inconsistencies and cleanse it or toss it out. This is an example of a simple map function using JavaScript:

var map = function (key, value, context) {
    var aryValues = value.split(",");
    var pattern = /[A-Za-z]/;
    var productLine;
    var salesAmount;
    productLine = aryValues[2];
    salesAmount = aryValues[6];

    if (pattern.test(productLine) && productLine.toString() != "null")
    {
        context.write(productLine.toUpperCase(),salesAmount.toString());
    }
};

Even though it’s a small function, you’ll see how it can be very worthwhile without being large and complex. The value portion of the key-value pair that is sent in as input to the function is a row of text data that you parse by splitting, using a comma as the delimiter because your data is comma-delimited. Then, you get the column values you’re interested in by accessing them using array offsets. So, in our example, Product Line is at offset 2 and Sales Amount is at offset 6.

The map function’s main purpose is to have a way to define your own key-value pairs for later processing. You use the context.write function to return a key-value pair as output, with Product Line being the key and Sale Amount being the value. The identification of the new key-value pair is made by the developer based on the problem being solved, and must be done correctly for MapReduce to get the desired results. Table 17-3 has an example of what this data looks like when it comes out of the map function.

Image

In between the map and reduce function, the MapReduce framework helps by doing some work for you. It groups and sorts the key-value pairs by key in a process known as the shuffle and passes the results on to the reduce step. Because your data has Product Line as the key, it is input into the reduce function grouped and sorted by Product Line. Each Product Line has every Sale Amount value associated with it. These new key-value pairs are written to local disk on each data node so the reduce function can read them. The key value pairs look like what is shown in the four rows below.

{M, (3399.99, 54.99, 899.99, 1578.27)}
{R, (699.09, 1777.77, 3528.99)}
{S, (35.99, 120.00)}
{T, (874.99)}
Image

Figure 17-3. MapReduce job with a single reduce task

Now you have something you can work with because the key-value pairs are sorted and grouped by key, but remember that this was done by the map step running map tasks on separate nodes. It’s possible that you could have the same key with different values on every server that executed a map task. The keys need to be brought together in the reduce step, so the MapReduce framework takes care of this by sending them to a common node so they can be processed together, as shown in Figure 17-3.

The reduce step takes this data as input and iterates through it. When it receives a key-value pair, it is up to the developer to process the data accordingly. To satisfy your requirements, the developer would need to iterate through each Sale Amount value and aggregate it. Simply speaking, the reduce step allows you to sum sales for a key and return the key with the aggregated total. When the reduce step finishes, the job is complete, and the output is written to HDFS. Clients must read the output data from HDFS to make it visible to the end user. For this example, the reduce output is a small file with four rows of data, total sales by Product Line.

var reduce = function (key, values, context) {
    var sumsales = 0.0;
        var productLine = key;

    while (values.hasNext()) {
        sumsales += parseFloat(values.next());
        }
    context.write(productLine, sumsales);
};

Reduce Output

{M, 5933.24}
{R, 6005.85}
{S, 155.99}
{T, 874.99}

MapReduce is the most important part of this chapter, so let’s take the example a step further to emphasize how much we made Hadoop do with a small amount of code. In a real-world scenario, this simple job could be executed on hundreds of computers at the same time, if the data is distributed across that number of data nodes, each of them taking on their own share of the load. While our code snippets have a big role to play in how the data is processed, a lot happens that isn’t visible to us. With so many servers involved, you’re probably thinking that there has to be a substantial amount of coordination, organization, and management going on—and you would be correct.

The MapReduce framework  takes care of this, starting MapReduce jobs on the data nodes that contain the data you need, monitoring the jobs for success or failure, and bringing the data together after it is processed independently. If a job fails, the framework can reexecute it, and this is good because you could be potentially running jobs on hundreds of servers at the same time and you wouldn’t want a failure on a single machine to stop you from your pursuit. A data node can fail for any number of reasons (such as a hardware problem, memory pressure, low disk space, etc.), yet the MapReduce framework can restart the job on another server that has the same set of data to keep things running.

Figure 17-4 depicts how values that reside on separate data nodes are brought together using their associated keys by MapReduce.

Image

Figure 17-4. A MapReduce process showing how key-value pairs are brought together in the reduce step

Even if you never write a line of MapReduce code, it’s important to know how it works because there are numerous tools available to DBAs for importing, exporting, and manipulating Hadoop data. These tools implement MapReduce to do their work, but hide the details from the user, while the JavaScript code allows you to customize the MapReduce function to suit your needs. In SQL Server terms, the JavaScript code would be T-SQL, while the Hadoop tools would be BCP and SSIS. The Hadoop tools you’ll use in the sample scripts later in the chapter will allow you to set some parameters that affect how they employ MapReduce to do their work, so understanding MapReduce at the code level is important for knowing how to use the tools effectively.

For in-depth coverage on setting up Hadoop, writing MapReduce code, and tuning and testing MapReduce jobs, see Pro Hadoop written by Jason Venner (Apress, 2009). Venner’s book covers MapReduce in much more detail and also discusses debugging, testing, and tuning MapReduce.

Another important factor in how MapReduce works is the hardware configuration. Even though we’ve been talking about a cloud configuration with Hadoop on Azure, it’s important to have some knowledge about the hardware that drives Hadoop.

Hardware

You might have heard that Hadoop can use commodity hardware, suggesting second-rate or low-quality machines, but don’t assume we’re talking about the old servers that have been sitting under your desk for years. The majority of your servers will be Slave Nodes with lots of local disk storage and moderate amounts of CPU and RAM. Some of the machines will be Master Nodes that might have a slightly different configuration with more RAM and CPU, but less local storage. These machines might be cheaper than the machines you normally use for SQL Server, but not necessarily. Many SQL Server DBAs are running their databases on this type of commodity hardware, so the term commodity might be misleading, allowing them to think that they can use even lower-end hardware for Big Data. In addition, for Big Data you’ll need a lot more servers that will still need a minimum level of storage and memory to perform in a satisfactory manner. In this context, commodity hardware refers to a collection of servers that can be purchased off the shelf, not high-end computers that are custom designed and built for super-high performance.

A generic minimum configuration for Hadoop should include two machines with two multicore CPUs, four directly attached local hard disks with at least 1 TB of storage each, and 24 GB of RAM. You would need two of these data nodes, giving you 8 TB of storage. If you are expecting to quickly ramp up a production Hadoop cluster to a larger scale, you could put together eight multicore servers, each with 32 GB of RAM and 12 2-TB disks, totaling 24 TB of storage. This would give you 192 TB of total storage, but since the data is stored three times for redundancy, it means you actually have 64 TB to work with. What happens if usage takes off and you need more storage? You simply add more data nodes. One of Hadoop’s strengths is the ability to add more data nodes to a cluster seamlessly, so whether you start with the smaller or larger configuration, you can always grow to a very large size. Consider that it would take approximately 130 24-TB data nodes to scale up to 3 PB (petabytes) of redundant data, while the framework has been designed to handle 10,000 data nodes!

Image Caution Notice that we didn’t talk about the storage being handled by a SAN. There’s a good reason for this. Hadoop was designed for locally attached direct storage so that each node controls its own data storage. A SAN could become a bottleneck when multiple data nodes access it simultaneously (or if used by other applications in an enterprise setting), and it’s a more expensive way to store data than commodity disk drives. Disk access is often the bottleneck when you process large amounts of data, and not sharing the storage reduces this bottleneck. Using separate storage dedicated to each data node allows the disk to be used in parallel more effectively.

While this high degree of distribution and parallelism provides an ability to store an almost infinite amount of data, an important thing to understand is that HDFS is limited in ways we’re not accustomed to as RDBMS-focused data architects. It supports forward-only parsing, so you are either reading ahead or appending to the end of a file. There is no concept of update or insert, and files can only be written to once. This is what makes HDFS very different from SQL Server, and will even seem crude at times compared to the tables, columns, views, and other database objects you’re familiar with.

Image Caution With such a highly distributed infrastructure running on so many servers, you think it’s going to be really fast, don’t you? Actually, it’s not fast when run on a small amount of data. The overhead of starting up the various MapReduce jobs, getting the data, and assembling the output together is very high, so it makes sense to crunch through a large amount of data when you do this. However, the process is excessive when working with a small dataset, since spinning up MapReduce jobs each time a data request occurs is so expensive. The code examples later in this chapter will seem to be slow on the small sets of data you’ll use. However, it will be much faster than traditional means when you get to the point where you’re processing data that is terabytes in size, but unlike SQL Server you’ll still get used to measuring query response times from a Big Data system in minutes and hours instead of milliseconds and seconds.

DBA As Data Architect

Previously, we looked at how an online retailer with a large set of sales data could aggregate it by product line with MapReduce . We’ll continue working with product data to work through more business problems and their solutions.

In this setting, you are a SQL Server DBA at the fast-growing online retailer and the company is looking to track more of what is happening with its product data. One uneventful morning, a developer you work with stops by your desk and starts talking about the new requirements he’s been handed. He wants a large amount of disk space allocated on the database server because he has been tasked with keeping a record of every change made to the Product table. In particular, because of dynamic pricing, the list price of a product changes frequently, an average of 20 times per day. Based on current activity and a total of 500,000 active products, this adds up to 10 million rows per day. Yikes!

Trying to clear your head from the initial shock, you ask the developer how the data is going to be used. He claims that part is a bit unclear because the finance department wants him to capture it just in case they need it later for an audit. Immediately, thoughts race through your head about disk space, memory, and impact to existing databases. Do you create a new database for this data, install a new SQL Server instance, or fire up a new server?  How will you manage it if it starts getting used heavily? It’s a scenario an experienced DBA is all too familiar with.

With Hadoop  added to your DBA toolset, you make a wise decision by implementing a solution that makes the best use of the many features of SQL Server, while using Hadoop for what it does best. To get this done, your main challenge as DBA will be to define, regulate, and manage the flow of data among the various places it can reside. An example of the many ways the data can flow is shown in Figure 17-5.

Image

Figure 17-5. Complicated data flow makes data architecture decisions more varied and complex.

Big Data for Analytics

Many of the demonstrations, webinars, papers, and other presentations you might have seen for using Big Data in a Microsoft environment are targeted at using it for analytics or to enhance a Business Intelligence framework. Microsoft has good reason for this; it’s an opportunity to show off tools such as PowerView  and Excel, which shine in those situations, giving them practically unlimited power as the amount of data an analyst can work with approaches infinity. It isn’t just Microsoft pushing this, as using it for analytics is also the most common reason for building a Big Data infrastructure.

The usual technique for analyzing data with these tools  is a three-step process of selecting the data to work with from within the HDFS data store, extracting a portion of it to a place the analyst can access it, and then analyzing it using Excel or a statistical analysis tool. Based on any findings or discoveries, the analyst repeats the process in an iterative fashion until a conclusion is reached. All of this, other than the analysis itself, is done in a  noninteractive way. Getting the data to a point where an analyst can work with it could take hours because the extract runs as a batch process in the background.

As an example of this, suppose an analyst wants to look at web logs for the past three months to determine which web page a potential customer goes to after viewing the About Us page on the web site. The web logs have been stored in HDFS for the past year and the storage has gotten large compared to the size of data this company is accustomed to, growing to over 100 TB in size. The analyst starts by setting some boundaries for this work, such as the dates of interest and the URL of the web page being targeted, which would be something like www.mywebsite.comaboutus.aspx.

This criterion set by the analyst becomes the filters that allow MapReduce to extract just the data the analyst wants. A MapReduce job is fired off to access the web log that has been distributed over 75 servers, potentially grabbing data from all of them and outputting the results to a place and in a form the analyst can access. The data is now trimmed down to a size the analyst can work with instead of the initial 100-TB dataset. The analyst is ready to do her job using tools she’s familiar with. She creates graphs to better visualize trends and patterns, runs the data through a statistical model for better understanding, and asks what-if questions that provoke more questions that require data the analyst doesn’t yet have. The analyst goes back to Hadoop to get more data and is ready to do it again. This iterative process continues until the analyst is satisfied with the results she is getting. This type of data access and analysis is the main reason for Hadoop’s existence and the primary use-case DBAs will need to support in a SQL Server + Hadoop environment.

Using SQL Server with Hadoop

The best reason for adding a Hadoop data store to work side by side with SQL Server is to let Hadoop be responsible for handling data that isn’t transactional in nature, isn’t required to be ACID compliant, and grows in size quickly. As a DBA, your job in this new era of data management will be to make the right choices for where your data resides.

The DBA’s Role

Some DBAs are data architects and modelers, but many are not. Most DBAs have as their first priority data protection, including backup/restore, disaster recover, and high availability and security. Others work on improving performance and reducing contention. The roles are muddled and the responsibilities are often unclear, and with Big Data, things are going to get murkier.

In newer software and database environments, such as those seen in technology startups or companies that don’t perform intentional data architecture, the role of the DBA as data architect is often nonexistent. Developers decide how and where to store data and the DBA’s involvement is that of an order taker, if any at all. In some enterprise environments where the data architect is king, every data-modeling decision is made or approved by his majesty. If the architect is simply another subject, the choices for storing data are made by the DBA in union with the development team. Often though, there is no data architect to play the role of gatekeeper on the data model and it’s developed in an ad-hoc manner. In any case, the DBA has a significant role in storing and maintaining the data to ensure its integrity and availability, and should be prepared to take on the management, safety, and reliability of the Big Data framework.

The lack of control and stewardship will be a challenge to a DBA adding Big Data to their sphere of responsibility. Consider what it will do to your current environment when you introduce the following characteristics of Big Data environments:

  • You have unrelated mounds of data without foreign keys that comes in all shapes and sizes.
  • Most data is processed in batch processes.
  • The data model is programmer driven.
  • The server hardware can be built out to a practically endless size.
  • The life cycle of data is undefined and unbounded.
  • Data cleansing is a question mark.
  • Tools for managing Big Data are in their infancy.
  • Bringing together disparate datasets will take expertise, effort, and money.

Big Data in Practice

Let’s explore a few patterns that show how to make use of Hadoop, and take a look at some examples that illustrate some ways that you might implement Big Data. One of the first decisions you’ll have to make, after deciding what data to store using Hadoop, is how you will use it and how it will interact with your SQL Server databases. There are numerous architectural choices, and numerous techniques for moving data back and forth. We’ll start by using Sqoop, a tool that will allow you to transfer data in bulk into and out of SQL Server.

Importing and Exporting

The first example you’ll work through is simply moving data into and out of HDFS. Assume that you need to solve a problem for your marketing department. The AdventureWorks Company is implementing a new sophisticated dynamic pricing system and marketing wants to know the price of a product for any day throughout the product’s history. Today, the AdventureWorks database has a ProductListPriceHistory table, but because product prices changed infrequently in the past, it tracked price changes by wide date ranges, so you’ll want to replace it so you can store a higher volume of data. You also know that, in the future, you’ll want to track everything about a product, including inventory levels, cost, marketing description, and more. To make sure you don’t miss anything, you store every product in HDFS at midnight every night using your new Hadoop infrastructure. Using the AdventureWorks for SQL Azure database, the Product table has the following definition, using two user-defined data types:

CREATE TYPE dbo.Name FROM nvarchar(50) NULL
CREATE TYPE dbo.Flag FROM bit NOT NULL

CREATE TABLE Production.Product(
    ProductID int IDENTITY(1,1) NOT NULL,
    Name dbo.Name NOT NULL,
    ProductNumber nvarchar(25) NOT NULL,
    MakeFlag dbo.Flag NOT NULL,
    FinishedGoodsFlag dbo.Flag NOT NULL,
    Color nvarchar(15) NULL,
    SafetyStockLevel smallint NOT NULL,
    ReorderPoint smallint NOT NULL,
    StandardCost money NOT NULL,
    ListPrice money NOT NULL,
    Size nvarchar(5) NULL,
    SizeUnitMeasureCode nchar(3) NULL,
    WeightUnitMeasureCode nchar(3) NULL,
    Weight decimal(8, 2) NULL,
    DaysToManufacture int NOT NULL,
    ProductLine nchar(2) NULL,
    Class nchar(2) NULL,
    Style nchar(2) NULL,
    ProductSubcategoryID int NULL,
    ProductModelID int NULL,
    SellStartDate datetime NOT NULL,
    SellEndDate datetime NULL,
    DiscontinuedDate datetime NULL,
    rowguid uniqueidentifier NOT NULL,
    ModifiedDate datetime NOT NULL,
 CONSTRAINT PK_Product_ProductID PRIMARY KEY CLUSTERED
(
    ProductID ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)

The tool you’ll use to do this bulk data load work is Sqoop, a component of the Hadoop ecosystem (Hadoop on Azure includes Sqoop as part of its offering, so you don’t have to install it separately). Sqoop is important for DBAs because it’s the best way to perform bulk data transfers  between Hadoop and SQL Server. The first thing to know about Sqoop is that an import moves data into HDFS and an export moves data out of HDFS (and into SQL Server, in this example). This example will show the movement of this data where both the SQL Server AdventureWorks  database and the HDFS data store are in the cloud. Later, you’ll work through an example where both data stores are inside your organization’s network, although any of the code can be made to work with either SQL Server or HDFS inside your network or in the cloud.

Image Note Apache Sqoop  is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores, such as relational databases.

To start, you’ll need a SQL Azure account. The first thing to do is add the AdventureWorks database to SQL Azure, which you can download from http://msftdbprodsamples.codeplex.com/releases/view/37304. Follow the instructions to get the database uploaded.

Then, you’re going to need access to a Hadoop cluster. For these examples, I allocated a Hadoop cluster in Windows Azure using the Community Technology Preview (CTP) edition at https://www.hadooponazure.com. After it’s allocated, you’re ready to go with a Hadoop cluster in the cloud, as shown in Figure 17-6. Note that because, as of this writing, Microsoft has released only the CTP edition of Hadoop, the figure shows that the cluster will expire soon, but it also displays a set of tiles that will let me use the cluster. You’ll spend most of your time running jobs from the command line that import and export data, so click Remote Desktop to directly log in to the cluster.

Image

Figure 17-6. Hadoop on Azure CTP main menu

Now you can put this cluster to use by putting some data in there. After you RDP and log in to the cluster, go to the desktop and launch the Hadoop Command Shell. Once at the command line, you’ll be at the root path for Hadoop and you’ll see the c:appsdist> prompt. Before you do anything else, set the Hadoop home environment variable by executing the following line:

SET HADOOP_HOME=c:Appsdist

The next step is to change the command-line path to c:Appsdistsqoopin, which is where you’ll run Sqoop jobs to get data from AdventureWorks in SQL Azure into Hadoop and back.

cd sqoopin

Then, you need to log in to your AdventureWorks database and add a synonym. It’s required because Sqoop puts brackets around table names, so it will need them when it accesses SQL Server. Go to SQL Azure and execute this SQL command:

CREATE SYNONYM [Production.Product] FOR Production.Product

The setup is done and you’re ready to get some work done! The first Sqoop job will copy the Product table from SQL Server to HDFS. This is akin to using the BCP utility for SQL Server, a tool you’re probably more familiar with. You aren’t specifying any filters, columns, or an explicit query, so the whole table will be copied.

Go back to the Hadoop Command Shell and enter the following script, replacing myserver in the connect argument with the name of your database server, and use an actual username and password that can log in to SQL Server. The --connect argument looks like a familiar connection string, and the import command tells Sqoop to copy data from the database to HDFS. The --target-dir argument is the folder path where the data will be stored in HDFS, and –m tells Sqoop the number of map tasks to run in parallel for the import. A partial list of Sqoop arguments and descriptions is shown in Table 17-4. Execute the script to get Sqoop running.

sqoop import --connect "jdbc:sqlserver://myserver.database.windows.net;
username=mysqluser@myserver;password=mypwd;database=AdventureWorks2012" --table
Production.Product --target-dir /data/ProductTable -m 1

Image

You’ll see a lot of logging messages and then, hopefully, success. At the end of the logging, you should see a couple of messages that tell you the time it took to run this job, the number of bytes transferred, and that all 504 rows from the Product table were imported into HDFS. It should look something like the following:

12/07/18 02:14:58 INFO mapreduce.ImportJobBase: Transferred 102.9434 KB in 43.5506 seconds
(2.3638 KB/sec)
12/07/18 02:14:58 INFO mapreduce.ImportJobBase: Retrieved 504 records.

If you’re like me, the first thing you asked is “Where did the data go and how can I look at it?” Go back to the desktop and launch the HadoopNameNode. The screen will show you a link labeled “Browse the filesystem” (see Figure 17-7). Click it and a web browser will open with a list of directories, one of them being the data folder. Click this folder and you should see another folder called ProductTable, which is what you specified would be the HDFS location of the data in your Sqoop script. Click this folder and you’ll see a file with the name part-m-00000. Sqoop creates files with a file name format of part-m-xxxxx, where xxxxx is a sequential number starting at 0. This is where your data should be.

Now open the file by clicking it and you’ll see what looks like a comma-delimited file of the Product table. I know, not too exciting, but that’s it. This is the way Sqoop loaded the table into HDFS and the way it lives there. Don’t worry too much, though, it’s very functional this way.

Image

Figure 17-7. Click the “Browse the filesystem” link to browse the datasets you’re storing in HDFS.

There are a few things to note about this initial data load. First, it didn’t make the best use of Hadoop. It was a small amount of data, and you’ll get more bang for the buck from your cluster with a larger dataset. Also, we asked Sqoop to do a vanilla load of the table into HDFS, so it loaded the whole table without using parallelism. To force parallelism, you could change the –m operator to a number greater than 1, and you would have to tell it the primary key of the table so it could split the data into multiple import jobs. If you don’t tell Sqoop what the primary key of the table is, then the operator –m must have a value of 1 or else it will fail. Take a look at Figure 17-8 to see a diagram of Sqoop importing data in parallel by breaking the work up into multiple map tasks.

Image

Figure 17-8. The Sqoop import data flow using MapReduce

What if you executed the same job again? It would fail because the file part-m-00000 already exists in the ProductTable folder. To succeed, you would have to add the -append operator, which would import the complete Product table again into the same folder, except this time put it in a file called part-m-00001. If you did this on a daily basis, you would store a complete snapshot of the Product table in HDFS with one file per day.

There are many more options for importing data with Sqoop. This example showed a general use of Hadoop, but often you’ll want to run a more targeted import, so next you’ll do one that gets data based on using a filter.

Extract Using a Date Filter

Imagine that your AdventureWorks SQL Server database contains 500,000 unique products, as you’re a huge online retailer. You’re still trying to build a data store that captures the history for product changes, especially pricing. As you saw in the previous section on import and exporting, you can take a complete snapshot of the Product table and capture it in HDFS every day. Instead, you choose to capture all changes to products, so you decide to run a job at the end of the day that grabs only product rows that have changed during the day. You can use Sqoop to do this by adding a few parameters.

First, you’ll issue a SQL UPDATE statement to increase the price of every product in your database by $1 if the price is between $45 and $100, which should be 42 rows in AdventureWorks. This will also update the ModifiedDate column to the time of the update using getdate().

UPDATE  Production.Product
SET     ListPrice = ListPrice + 1,
        ModifiedDate = getdate()
WHERE   ListPrice between 45 and 100

Sqoop Import from Azure

Because we’re still using SQL Azure, you’ll next get the changes from there and put them in the same HDFS folder you’ve been using, but you’ll embellish the Sqoop import first. You’ll notice we added a where clause to the import, so this pulls only the five rows that changed on the date 2012-07-19 (note that the date depends on when you execute the sample Update statement). We changed the –m parameter to 4, so four MapReduce tasks can run in parallel (I know, overkill for 42 rows, but it’s an example).

Next, because you are using an –m setting of greater than 1, you need to identify the key that the MapReduce jobs can use to split the data up among themselves, so they aren’t overlapping each other when grabbing data. This way every row belongs to one and only one job. To indicate the split key, you use –split-by and tell it that the key column is ProductID. Finally, you decide not to get every column from the Product table, so you use the –columns operator and list only the columns that you’re interested in.

sqoop import --connect
"jdbc:sqlserver://myserver.database.windows.net;username=mysqluser@myserver;password=mypwd;
database=AdventureWorks2012" --table Production.Product --where "ModifiedDate
between '2012-07-19' and '2012-07-20'" --target-dir /data/ProductTable -m 4 -append
--columns "ProductID,ProductNumber,ListPrice,ModifiedDate" --split-by ProductID

After executing the job, go back to the ProductTable folder in HDFS. What you see might surprise you. There are four new files, one for each MapReduce task. Each file has a different number of rows because Sqoop doesn’t balance the load among the four MapReduce tasks, but simply divides the work by getting the min and max values of the primary key and arithmetically splitting it into four groups. This way, each row that changed on 2012-07-19 ends up in one, and only one, of the four files. Notice too that each file has only the four columns you requested in the Sqoop job.

Just to show that there’s more than one way to accomplish anything, the following script gets the same results. What is interesting about this technique is that you can write SQL queries using Sqoop. Notice the use of the $CONDITIONS token in the where clause. Sqoop replaces this token with a unique condition expression so the job can be run in parallel and each task get a different range of data.

sqoop import --connect  "jdbc:sqlserver://myserver.domain.com; username=mysqluser;
password=mypwd; database=AdventureWorks"
--query "SELECT ProductID,ProductNumber,ListPrice,ModifiedDate FROM Production.Product
where $CONDITIONS and ModifiedDate between '2012-07-19' and '2012-07-20'"
--target-dir /data/ProductTable -m 4 --split-by ProductID -append

This Sqoop job executes several queries in SQL Server. One of the queries gets the minimum and maximum ProductID. Sqoop uses the results of this query to divide the ProductID into four equal parts so it can execute four separate queries in parallel that access a different set of rows in the table (but not necessarily an equal number of rows).

SELECT MIN(ProductID), MAX(ProductID) FROM
(SELECT ProductID,ProductNumber,ListPrice,ModifiedDate FROM Production.Product where  (1 = 1)
and ModifiedDate between '2012-07-19' and '2012-07-20') AS t1

Sqoop then executes the following four queries in SQL Server. The $CONDITIONS expression is replaced in each query by the clause where ProductID between MIN and MAX. See how the MIN and MAX values for each query are determined by the results of the calculation performed in the previous query that split the data being read into four parts.

SELECT ProductID,ProductNumber,ListPrice,ModifiedDate FROM Production.Product where
( ProductID >= 1 ) AND ( ProductID < 251 ) and ModifiedDate between '2012-07-19' and '2012-07-20

SELECT ProductID,ProductNumber,ListPrice,ModifiedDate FROM Production.Product where
( ProductID >= 251 ) AND ( ProductID < 501 ) and ModifiedDate
between '2012-07-19' and '2012-07-20

SELECT ProductID,ProductNumber,ListPrice,ModifiedDate FROM Production.Product where
( ProductID >= 501 ) AND ( ProductID < 750 ) and ModifiedDate
between '2012-07-19' and '2012-07-20

SELECT ProductID,ProductNumber,ListPrice,ModifiedDate FROM Production.Product where
( ProductID >= 750 ) AND ( ProductID <= 999 ) and ModifiedDate between '2012-07-19'
and '2012-07-20

The result is no different from the prior Sqoop job you executed and the data is again split into four files.

As mentioned previously, Table 17-4 shows a list of the Sqoop arguments we’ve used with their descriptions. To see a full list of Sqoop parameters, go to http://sqoop.apache.org/.You’ve done some good work, but when you show the marketing people, you learn that the requirements have changed. They really want a complete audit trail of every change made to a product. The data load script you’ve developed with Sqoop gets the last version of a product for the date range you used in the query. If you run the  script once per day, changing the date range each time, you’ll get the latest row for every product that was changed during that day. You’re losing all of the information in between, so you would lose the change detail if a product changed five times in between your jobs running. You want all of the changes made to a product, so when a price is lowered by 50 percent for five minutes, you would know about it because you will capture the change.

Also, your dynamic pricing system has gone crazy. It is now changing prices on products 20 times per day, and since you have 500,000 active products, that’s 10 million rows each day you’ll capture in HDFS. Good thing you’ve got Hadoop ready to go, because now you’re really getting some Big Data! For the future, marketing is already saying they want to see more than just price changes. They also want to know the inventory levels, cost, and other important data about a product every time it changes.

Let’s see how you can make better use of SQL Server to get this work done. SQL Server can identify every change to the Product table with Change Data Capture (CDC). CDC will store an image of every insert, a before and after image of every update, and a before image of every delete. To enable CDC for the AdventureWorks database and the Product table, execute the following:

Use AdventureWorks
Go
exec sp_cdc_enable_db
Go
exec sp_enable table @source_schema='Production', @source_name='Product', @capture_instance
Go

Next, change some data so you have data in the CDC tables. To do this, execute the following T-SQL commands for the AdventureWorks database. These inserts, updates, and deletes will generate change rows that are stored in the cdc.dbo_Product_CT table.

INSERT INTO Production.Product(Name, ProductNumber, MakeFlag, FinishedGoodsFlag,
SafetyStockLevel, ReorderPoint, StandardCost, ListPrice, DaysToManufacture, SellStartDate)
VALUES '29 Inch Wheel','WH-8232',0,0,125,80,85.9500,126.9500,1,'2012-08-01 00:00:00'

INSERT INTO Production.Product(Name, ProductNumber, MakeFlag, FinishedGoodsFlag,
SafetyStockLevel, ReorderPoint, StandardCost, ListPrice, DaysToManufacture, SellStartDate)
VALUES 'MH Hybrid','HY-2452',0,0,210,100,865.0000,1295.0000,1,'2012-08-15 00:00:00'

UPDATE Production.Product SET StandardCost = 1799.00 where ProductID = 776
UPDATE Production.Product SET ListPrice = 2195.99 where ProductID = 771
UPDATE Production.Product SET ReorderPoint = 400 where ProductID = 936
UPDATE Production.Product SET SafetyStockLevel = 120 where ProductID = 961

DELETE FROM Production.Product where ProductID = 712

Go back to Sqoop and run the following script. It will read data from the CDC tables and put it in HDFS, this time capturing every single change made to the Product table with CDC helping you out. Make sure you put the correct start and end dates for when you executed the changes so the import actually retrieves data.

sqoop import --connect  "jdbc:sqlserver://myserver.domain.com;username=mysqluser;
password=mypwd;database=AdventureWorks" --query "SELECT * FROM cdc.Production_Product_CT
where $CONDITIONS and ModifiedDate between '2004-03-11 08:00:00' and '2004-03-11 09:00:00'"
--target-dir /data/ProductAudit -m 1

Exporting from HDFS

You’re successfully putting every change made to the Product table into HDFS, but what do you do with it? I’m glad you asked, and next we’ll develop two different ways to work with it. One approach you can take is to export a subset of the data and put it back into SQL Server. The advantage of this is that your applications, developers, and analysts can use it in ways they are used to, in particular employing tools that they’re familiar with, such as T-SQL. An alternative method to using the data is to query and analyze it as it is in HDFS. I’ll display the former technique and once again use Sqoop to export it from HDFS into SQL Server. To drop it in SQL Server, I have to have a database and table to target. The latter technique will use Hive to work with the data directly in HDFS.

Now that you’ve collected billion and trillions of rows, and are able to track changes to your products, the time has come for you to use the audit trail. Your financial team wants to know how a certain product was priced throughout the year and the exact amount of time any price was in effect. You can extract that data from HDFS and put it in a familiar SQL Server table by using Sqoop.

The financial people aren’t asking you to retrieve every single piece of data about the product for this report and they say that all that will be needed is ProductName, ProductNumber, ListPrice, and ModifiedDate. However, since you also imported ProductID and ProductNumber into HDFS, then you also have to export it, because Sqoop doesn’t allow you to export select columns, and it doesn’t give you a chance to filter the data you export. As you can see, it’s all or nothing, so you need to create a table in SQL Server to hold all of the data in HDFS.

CREATE TABLE dbo.Product_Changes(
    ProductID int NULL,
    Name nvarchar(50) NULL,
    ProductNumber nvarchar(25) NULL,
    ListPrice money NULL,
    ModifiedDate datetime NULL
)

After you create this table, you are ready to go back to the command line and execute the following job from the c:Appsdistsqoopin directory:

sqoop export --connect
"jdbc:sqlserver://myserver.domain.com;username=mysqluser;password=mypwd;database=AdventureWorks"
--table [Product_Changes] --export-dir /data/ProductAudit –m 4

After this job successfully executes, you now have the product data in a familiar place, SQL Server, and can do whatever you want with it. This puts you (and your users) in a comfortable situation, and you can use familiar and powerful tools like Excel, SSRS, and PowerView to analyze the data and create reports. There are good reasons for moving data from Hadoop into SQL Server, including joining Hadoop data to existing SQL Server entities in star schemas, cubes, and other reporting structures. The location of data storage is a decision DBAs will have to make, but in organizations where analysis is already happening in a SQL Server environment, it will be a normal practice to move some data from Hadoop to SQL Server. This doesn’t mean that you will move every petabyte of data from Hadoop to SQL Server, but if were to happen, it would be in a filtered and/or aggregated manner.

Take note, however, that this is just one use-case of many, and that bringing data back into SQL Server is not necessary to use visualization tools such as Excel. You’ll see in the next section how you can directly access data stored in HDFS using Excel, and how Hadoop provides you with a tool like Hive that lets you do this.

Hive

What if you don’t want to put the data in SQL Server? Can you use it where it resides? Yes you can, and Hive is a tool you can use to work with it in HDFS. Hive provides you with HiveQL,  a SQL-like language that you can use to query data in HDFS. Apache states that Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop-compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.

Image Caution For people experienced with SQL Server and data warehousing, take the statement that Hive is a data warehouse system loosely. It doesn‘t use a recognized data warehouse model or any data structures that you might be familiar with in data warehousing, but it is useful as a SQL-like query language on top of HDFS.

You’ll need to have some data in HDFS, so if you haven’t done this yet or have wiped out previous data load we’ve worked with, use Sqoop to populate HDFS so you have something to work with.

sqoop import --connect "jdbc:sqlserver://myserver.database.windows.net;
username=mysqluser@myserver; password=mypwd; database=AdventureWorks2012"
-table Production.Product --where "ModifiedDate between '2012-07-19' and '2012-07-20'"
--target-dir /data/ProductAudit -m 4 -append
--columns "ProductID,ProductNumber,ListPrice,ModifiedDate" --split-by ProductID

To start using the data, you need to define a Hive table. Use CREATE EXTERNAL TABLE to do so, which provides you with an interface to HDFS. This is good since you already have data there. If you didn’t tell Hive the location, it would create the table in its own default location. In effect, what you are doing is defining a table in the Hive metabase and telling it where the data that it represents resides. For SQL Server people, this might sound backward, since you usually create the table first, then add data. In this case, Hive is simply a front-end to your HDFS data. Hive also allows you to create new tables that you populate with data when you create them, so it can be more than just an interface to existing data.

To run a Hive command, use the user interface provided by Hadoop on the Azure web site (refer back to Figure 17-6). After logging in to your cluster, go to the main screen and click the Interactive Console tile. Then click the Hive button at the top right to change the console to Hive mode. At the bottom, you’ll be able to enter a Hive command, so create the table using this command:

CREATE EXTERNAL TABLE price_audit (
ProductID INT,
Name STRING,
ListPrice DOUBLE,
ModifiedDate STRING
)
COMMENT 'This is my output'
ROW FORMAT DELIMITED FIELDS TERMINATED by ',' STORED AS TEXTFILE
LOCATION '/data/PriceAudit/';

Notice that Hive requires you to specify the name and data type of each column. These should look familiar to you, as I have named the columns with the same names that are used in the SQL Server table. However, you may name the columns whatever you like; they don’t have to have the same name as the SQL Server table that was the source of the data.

Hive allows you to specify a number of arguments when you create a table, some of which are listed in Table 17-5. Using the arguments ROW FORMAT DELIMITED FIELDS TERMINATED by ',' and STORED AS TEXTFILE define the Hive table as a comma-delimited text file. The LOCATION keyword lets you specify the HDFS folder where the Hive table data is stored. For a complete description of Hive syntax, see the documentation at http://hive.apache.org.

Image

The column delimiter we used in our example was a comma, which we did with ROW FORMAT DELIMITED FIELDS TERMINATED by ','. We made sure that the LOCATION path was set to the folder where we’ve been putting data into HDFS. You’re now all set to query this data using HiveQL.

Once this is executed, you can execute a HiveQL query. Use this select statement to get the data. Thankfully, the syntax looks just like the SQL language we’re so familiar with.

select * from price_audit

You’ll see a result set of the data that you previously stored in HDFS:

218 LJ-0192-S 50.99 2012-07-19 03:35:12.0
219 LJ-0192-M 50.99 2012-07-19 03:35:12.0
220 LJ-0192-L 50.99 2012-07-19 03:35:12.0
221 LJ-0192-X 50.99 2012-07-19 03:35:12.0
314 HB-M763 62.92 2012-07-19 04:18:38.12
317 HB-R720 62.92 2012-07-19 04:18:38.12
320 FW-M423 61.745 2012-07-19 04:18:38.12
. . .

If you want to limit the size of the results because you’ve run the sample code using much bigger datasets than I did, then you can use the LIMIT operator, like so:

select * from price_audit limit 5


218 LJ-0192-S 50.99 2012-07-19 03:35:12.0
219 LJ-0192-M 50.99 2012-07-19 03:35:12.0
220 LJ-0192-L 50.99 2012-07-19 03:35:12.0
221 LJ-0192-X 50.99 2012-07-19 03:35:12.0
314 HB-M763 62.92 2012-07-19 04:18:38.12

What happened in the background of these queries? Again, these are MapReduce jobs, and Hive is simply acting as a layer for MapReduce that is more familiar to SQL users. If you go look at the job log for Hadoop on Azure, you’ll see the jobs that are executed for each of these queries. You can click on the Job History tile of the main screen.

Of course, we don’t just want to extract detail data. We could do that with Sqoop. While Hive doesn’t provide all of the functionality we’re used to with T-SQL, it does give us the ability to group and aggregate, filter and sort, and it also has some join capabilities. For example, you can issue the following command to group, sum, and join. Once again, notice how much these queries look like the SQL were so familiar with.

SELECT Name, ListPrice, ModifiedDate
FROM price_audit
where Name = 'LJ-0192-X'
order by ListPrice

221 LJ-0192-X 50.99 2012-07-19 03:35:12.0
221 LJ-0192-X 54.99 2012-07-20 09:24:51.0
221 LJ-0192-X 52.99 2012-07-20 14:32:18.0

SELECT Name, COUNT(*)
FROM price_audit
GROUP by Name

LJ-0192-X 4
HB-M763 1
HB-R720 1
FW-M423 1

Getting data using HiveQL is an important part of the complete solution, one that allows you to make Hadoop data accessible to your end users and analysts via Excel. We’re going to walk through an example of how to use Excel together with Hive to close the loop and make the most of this data.

Hive and Excel

An ODBC driver is provided by Microsoft for Hive so that HiveQL queries can be issued from analysis tools and other applications. Excel can use the Hive ODBC driver, and an add-in is available so data can be pulled in for analysis. To work with Hive and Excel, do the following:

  1. Click on the Open Ports tile in the Hadoop on Azure main menu.
  2. Open the Hive ODBC port by clicking the toggle button so it slides to the right and the status changes to Open.
  3. Back on the main menu, click the Download tile.
  4. Download and install the Hive ODBC driver and Hive Add-in for Excel. Follow the instructions to make sure you install the right versions (32-bit or 64-bit).
  5. Launch Excel and go to the Data menu. Click the Hive Pane and Enter Cluster Details. If it all works, you’ll see the list of Hive tables you’ve created in the table drop-down list.
  6. Now you’re ready to start working in Excel with the product data you put in HDFS. Simply select a table and some columns, add criteria and groupings, and execute the query.
  7. In addition, you can go to Data Sources (ODBC) in Windows and add the Hive ODBC DSN, as shown in Figure 17-9, so you can use it from other applications.
Image

Figure 17-9. ODBC Hive setup using Data Sources (ODBC) in Windows

While the primary user of Hive via Excel won’t be a DBA, it’s good to understand how the data they administer will be used by others. For example, most DBAs won’t be doing analysis of Hive data using Excel, but it will likely be used by data analysts because it is a software tool they are familiar with. To develop a strong architecture with Big Data, it’s important to know how data will be used and how the tools that use the data work. This helps you make better architectural decisions for managing data.

JavaScript

Remember the JavaScript that you used to work through the MapReduce example earlier in the chapter?  Now that you have data in HDFS, you can put that code to work. The JavaScript will read the product data you imported with Sqoop into HDFS, and stored within the directory /data/ProductTable with the first job you executed.

The product data you’re using isn’t the same as the sample data you worked with, so you’ll change the JavaScript to make use of what you have. The business problem you’ll solve is to calculate the average list price of products by product line. You also must consider that the data is comma-delimited, and that some of the product names have a comma in the text.

var map = function (key, value, context) {
    var aryValues = value.split(",");
    var pattern = /[A-Za-z]/;
    var productLine;
    var listPrice;

    if (aryValues.length == 25)
    {
        productLine = aryValues[15];
        listPrice = aryValues[9];
    }
    else
    {
        productLine = aryValues[16];
        listPrice = aryValues[10];
    }

    if (pattern.test(productLine) && productLine.toString() != "null")
    {
        context.write(productLine.toUpperCase(), listPrice);
    }
};

This map function doesn’t use the key passed into it, as you don’t care about the offset of the row within the file (and normally won’t). The product data you’re interested in is passed into the function in the value parameter, which is split into an array using a comma delimiter. Since it’s possible there’s a comma in the product name, you then check the length of the array to see whether there are more cells than expected. If the length of th e array is 25, then there isn’t a comma in the product name, so product line is in cell 15 of the array and list price is in cell 9. If there is a comma in the product name, then the product line and list price are in cells 16 and 10, respectively. Because nulls are allowed in the product line in the Production.Product table that we used as a source for this data, you must check the data to make sure it is an alpha character and that it isn’t null. This will filter out about half of the rows and is exactly the type of cleansing that belongs in a map function. It also saves resources by making sure you do not pass this data on to the reduce function.

var reduce = function (key, values, context) {
    var sumListPrice = 0.0;
    var productLine = key;
    var cnt = 0.0;

    while (values.hasNext()) {
        sumListPrice += parseFloat(values.next());
        cnt++;
        }

    var avgListPrice = sumListPrice / cnt;
    context.write(productLine, avgListPrice);
};

The product line is passed in to the reduce function as the key, which is the way you defined it in the map function, and all of the list prices for each product line are passed in as the values collection. The main thing you do in the reduce function is to iterate through the values to sum all of the list prices and count the number of values so you can calculate the average. When finished, you return the product line and average list price, and it’s these values that are written to HDFS. The resulting data will look like this:

M    827.0639560439561
R    965.3488000000007
S    50.39885714285715
T    840.7621153846153

To run this code, copy both the map function and the reduce function into a file called AverageListPrice.js. You can do this using Notepad, then save it to your local machine. After this is done, upload the file using the Hadoop on Azure web site. Go to the main menu and click the Interactive Console tile. When the console displays, make sure that the JavaScript button is clicked on the top right, and that the js> prompt is visible.

To upload the JS file you just created, type fs.put() in the prompt. A pop-up window will ask you to find the JS file you want to upload, so browse to AverageListPrice.js and select it. Leave the destination box empty and click Upload. This will put the file in your user’s default file location in Hadoop, which is /user/username.

After the JavaScript file is uploaded, you’re ready to run it. To do so, key in the following at the prompt:

runJs("AverageListPrice.js", "/data/ProductTable", "AverageListPrice")

The runJs function will execute the map and reduce functions you wrote in JavaScript. There are three parameters you pass in to runJs: the name of the JavaScript file you uploaded, the name of the directory that contains the input data, and the name of the directory for the output data. In this example, we’re using /data/ProductTable as the input data because we stored the product data there earlier, and the output data will be saved in your /user/username folder in HDFS. Execute runJs and you will see some system messages scroll by as the job runs. When it’s finished, the screen will look like it does in Figure 17-10. Now you can look at the output data.

Image

Figure 17-10. Interactive JavaScript screen in Hadoop on Azure

Using remote desktop again, go to the desktop and launch the HadoopNameNode, like you did in the section on Sqoop (refer back to Figure 17-7). Click the link labeled Browse the filesystem and a web browser will open with a list of directories, one of them being the user folder. Open it and you’ll see another directory named the same as your user name. Click it to see the files and directories under your user name and find the directory named AverageListPrice. This is where your output data is. Open the directory to see a file named part-r-00000 and take a look at the file by clicking it. The file should have four rows that look just like the output we saw earlier in this section.

Pig

Pig is a platform for analyzing large datasets. It consists of a high-level language for expressing data analysis programs, and allows for a high degree of parallelization. Pig includes Pig Latin, which is a procedural scripting language that allows you to manipulate data in more powerful ways than you can with Sqoop. While Sqoop is analogous to BCP in SQL Server, you can consider Pig the ETL tool of Hadoop, so naturally it is highly parallel when extracting, transforming, and loading data.

Pig has a good set of built-in functions for string manipulation, mathematical operations, and for groupings, aggregations, and comparisons. It also has relational operators for joins, unions, and more functionality for controlling the flow of the script and the processing of data. In more sophisticated scripts, Pig developers can use nested data structures, streaming operators, and user-defined functions. The strength of Pig is that it can load data in parallel, apply analytical functions to the data, and then write the results of the analytic process.

Image Note Pig will be an option for your environment when you want to do more than simply move data back and forth between SQL Server. It allows you to inspect, crunch, and convert data as you’re moving it from one place to another.

In today’s version of Hadoop on Azure, the primary use of Pig will be to read data from HDFS, process it procedurally, and write the results back to HDFS. It won’t involve the SQL Server database engine; although in the future, it seems that it would be sensible to have a way to store the results in SQL Server. The following is a simple Pig script that mirrors T-SQL for grouping by ProductID, counting the number of rows, and sorting by the row count:

product_audit = LOAD '/data/ProductAudit/part-m-00001' USING PigStorage(',')
        AS (
ProductID:int,
Name:chararray,
ProductNumber:chararray,
ListPrice:float,
ModifiedData:chararray
);

product_group = GROUP product_audit by (ProductID);
product_count = FOREACH product_group GENERATE group, COUNT($1) as num_products;
product_sorted = ORDER product_count by num_products desc;

STORE product_sorted INTO '/data/ProductAudit/aggregate_products2' USING PigStorage(','),

This script defines the structure of the data, groups the data by product ID, gets a count of the number of times a product appears in the dataset, sorts that in descending order, and saves the data back to HDFS in a file called aggregate_products. This isn’t very different from what we can do with SQL, so it’s a good example to start with to understand the structure of a Pig script and how to execute it. Table 17-6 describes the Pig commands we’re using; go to http://pig.apache.org for a full reference.

Image

To run the Pig script, go to Remote Desktop again in Hadoop on Azure and launch the Hadoop Command Shell from the desktop. Enter cd pig at the command prompt to go to the pig directory. Launch Notepad from the Start menu, copy and paste the sample Pig script, and save it as product.pig in the c: Appsdistpig folder. Execute it from the command line by typing pig product.pig and pressing Enter. If it completes successfully, you’ll now see a new file in HDFS in the ProductExtract folder with the output of the script.

Let’s do something that will highlight some more features of Pig. Assume the finance department now wants to know what the inventory level of a product is, every time you capture a price change. To get some data that you can use, load the Product table and the ProductInventory tables to HDFS by running the following two Sqoop imports into HDFS. Recall that for Sqoop to work, you created a synonym for the Product table. Now do the same for the ProductInventory table, like so:

CREATE SYNONYM [Production.ProductInventory] FOR Production.ProductInventory

sqoop import --connect "jdbc:sqlserver://myserver.database.windows.net;
username=mysqluser@myserver; password=mypwd; database=AdventureWorks2012"
--table Production.Product --target-dir /data/Product -m 1

sqoop import --connect "jdbc:sqlserver://myserver.database.windows.net;
username=mysqluser@myserver; password=mypwd; database=AdventureWorks2012"
--table Production.ProductInventory --target-dir /data/ProductInventory -m 1

When this completes, you’ll have one file in the /data/Product and /data/ProductInventory folders to work with. Now you can join this data together with Pig, as follows:

product = LOAD '/data/Product/*' USING PigStorage(',')
        AS (
ProductID:int,
Name:chararray,
ProductNumber:chararray );

product_inventory = LOAD '/data/ProductInventory/*' USING PigStorage(',')
        AS (
ProductID:int,
LocationID:int,
Shelf:chararray,

Bin:int,
Quantity:int
);

product_group = GROUP product_inventory by ProductID;
inventory_sum = FOREACH product_group GENERATE group as ProductID,
COUNT(product_inventory.LocationID) as num_locations, SUM(product_inventory.Quantity)
as total_inventory, AVG(product_inventory.Quantity) as avg_inventory;
inventory_filter = FILTER inventory_sum BY total_inventory<1000;
product_join = JOIN product BY ProductID, inventory_filter BY ProductID;
product_sorted = ORDER product_join by total_inventory desc;
STORE product_sorted INTO '/data/ProductOutput/inventory_list3' USING PigStorage(','),

Pig allows for manipulation of HDFS data via its procedural language, Pig Latin, providing a more advanced tool for transforming data and letting you solve more complicated problems. In this example, you used Pig to get data from HDFS and to transform and store it in another HDFS data store, and you did so without impacting the SQL Server. This is why Pig is important to the SQL Server DBA. Not everything has to go through SQL Server and a DBA needs to know when it’s more effective to use HDFS as both the input and output data store. Knowing the best ways to take advantage of Pig and its features will reduce the stress you put on your SQL Server databases.

Big Data for the Rest of Us

The analysis scenarios, while certainly the most popular reason for Big Data’s existence, are not the only use-case. Before we let the data analysts take over everything, let’s look at how Big Data can impact the rest of the world. We’ll also look at scenarios where the work the data analyst is doing can impact the DBA’s architecture, and how the DBA can deal with both RDBMS and Big Data working in conjunction.

Business Intelligence

While Big Data is often used for analytics, it’s not necessarily synonymous with Business Intelligence and what it has come to be. Some experts will even tell you that Big Data means the end of BI, but that’s not likely. A comprehensive BI framework should encompass Big Data and, at the very least, use it as a data source. BI includes a type of analysis and visualization that allows for rapid-fire what-if analysis with a high degree of user interaction. It can also convey the state of the business, relative to key performance indicators, by taking a snapshot of the business at strategic points in time. Under the BI umbrella, we would also include online analytical processing, data mining, predictive modeling, performance management, and more. For planning purposes, you should expect that data you collect in a Hadoop framework will become part of the data used by the BI framework.

Big Data Sources

Where is all of this data coming from at such a high velocity, in such large volume, and of so much variety? Sensors, logs, web content, text, map, media, social computing, and click-stream data are being collected at higher rates than ever, and the ability to capture this data has grown enormously over the past several years. This data can be produced at a high rate of speed, or velocity in Big Data terms.

The most common data sources for Big Data analytical applications are web data in the form of web logs, sensors that continuously capture data (such as movement  and location), unstructured content (including video and images), and social networking data. Web logs are analyzed to optimize site navigation and to understand the behavior of online customers. Sensors are driving the collection of large amounts of data, since near real-time analytics can be performed on data streams being produced by sensors. If a problem is spotted, swift action can be taken to correct the situation. Data from sensors can also be stored for future analysis to discover patterns that indicate persistent problems.

There is no minimum volume of data to be considered Big Data, but typically the volume starts in the terabytes and could grow to petabytes in size. The same is true for variety. While Big Data has made a name for itself by being good at handling unstructured data, it can also handle more well-structured data, and there’s no yardstick as to how much unstructured data qualifies it as Big Data.

Big Data Business Cases

Let’s look at what types of datasets are a good fit for your Big Data framework. Building on the use-case of how web log data can be used for analytics and discovery, there are other types of datasets that can also be handled better by Hadoop than with an RDBMS. As you read on, you’ll start to see how they can work in conjunction with SQL Server.

The Nonrelational Dataset

How many times have you stored data in a flat, nonrelational table because you needed a place to put it? Much of the data we work with is nonrelational, or better said it is pseudo-relational. We treat it nonrelationally because it’s unclear what it’s related to, it can’t be related to anything we have, or relating it to something is unnecessary. Extracts, staging data, logs, and data that originates from external sources can all be put in this category.

Streaming Data from Sensors

Do you capture a stream of data and only keep a small window (e.g., a rolling two weeks) because it accumulates too fast and makes your database very large quickly? Or do you retain only a random sample because its size overwhelms you? If you’re already capturing data from sensors and it’s akin to “drinking from a fire hose,” then you already know that a sensor has the ability to produce an overwhelming amount of data.

Archives

Do you constantly delete or archive data either manually or using regular jobs because it clutters your transactional database? This data often gets archived within SQL Server, whether it’s another table within the same database, or another database whose main purpose it is to hold archived data. By keeping it in a SQL Server database, it’s still accessible when needed by the business, but it can add up fast and grow to an unmanageable size that impacts everything else you’re doing with SQL Server.

Unstructured Data

Many uses of Big Data mention it as a good way to handle unstructured data. Unstructured data is inconsistent and undefined data, with no rules about how the data is arranged, and often with no length limit. This is something that you might often store in SQL Server as a BLOB, and consider it a burden on your database.

Data Discovery

Sometimes, you don’t know what you need until you need it. Unfortunately, when the time comes, you often find out that the data you need to move forward isn’t available. Other times, you just want to go look. For example, if you’ve noticed a change in your customer’s buying habits and want know why, you might think it would be helpful to scour through web logs, historical transactions, customer comments, and more. This work might lead you to the answer you’re looking for, but it also could help you learn something else about your customers, something you weren’t initially looking for.

Business Rules Driven by Analytic Outcomes

There are times when your OLTP database needs to perform operational analytic functions that require the output of Big Data calculations. For example, assume a marketing department wants to highlight one of three new movies on their home page: a Brad Pitt drama, an Adam Sandler comedy, or a Bruce Willis action movie. The decision of which movie to show is made programmatically, with many inputs that include customer demographic, location, past purchase history, and more. One of the more important factors in making this decision is whether the actor is trending positively on Twitter. You can capture the relevant Twitter stream using a Big Data infrastructure, and in as near to real-time as possible, you can feed the aggregated sentiment back to your OLTP database, so it can be applied by the algorithm that determines what to show on the home page. The key to this process is to change the home page as soon as possible, after the trend changes.

Big Data in the Microsoft Future

The benefit to Microsoft embracing Hadoop and Big Data is that in the future we should see full integration between Hadoop and the many Microsoft tools that provide and support analytic capabilities, including reporting, dashboards, predictive modeling, and ETL (extract, transform, and load). The types of enhancements I would expect to see include the following:

  • You’ve seen how HDFS data can be used in Excel through Hive. I would expect tighter integration of HDFS with Excel, PowerView, SQL Server Reporting Services, and PerformancePoint.
  • I expect better SSIS data flow tasks that allow for data to be input and output from HDFS, and transformation tasks that provide transformation techniques better suited for HDFS.
  • I anticipate tighter coupling between SSAS tabular modules using HDFS as a data source.
  • I also would expect the ability to create an HDFS table within SQL Server that implements MapReduce seamlessly.

There’s much more to come in this area, and I expect that much of it we’ll see in the on-premise solution of Hadoop on Windows.

Conclusion

The implementation of a Hadoop framework in a SQL Server database environment gives the informed database administrator a more effective and powerful data architecture to design and implement. It provides the ability to scale out to a level where the physical size of data is no longer a constraint. This puts the responsibility on DBAs for determining how to build out a data infrastructure, and forces them to make the right decision for a particular dataset; should it reside in SQL Server or HDFS? To make these decisions, DBAs will have to know how the business will use the data, how developers will access it, and how much of it there will be. No longer will the default location be a traditional SQL Server table.

Another important decision DBAs will have to make is whether to store the data on premise or use the cloud. We were able to move data between Hadoop and SQL Server where both were in the cloud. We also moved data to and from a SQL Server instance inside our network and Hadoop in the cloud, and showed how we could use one of the many features of SQL Server, Change Data Capture, to help us do this.

An important process that DBAs will have to support is the movement of data between SQL Server and HDFS. We showed how to do this with Sqoop, which does this in a rudimentary manner but is easy to use. Hive provides a familiar SQL-like interface to get data out of HDFS. Using the ODBC connector, this gives users and applications a way to get data from HDFS. We saw how this works in Excel, and DBAs should expect this type of access in many other ways, including reporting tools, analysis software, .NET application, and web sites. Pig let us access HDFS and manipulate the data procedurally. It requires programming skills and expertise but is worth learning well to make the most of its data transformation capabilities. The key capability of Sqoop, Pig, and Hive is that they all execute MapReduce jobs and can fully take advantage of the Hadoop framework. It’s important for DBAs to know how to use these tools and to know the right application of their functions.

We showed in the section on exporting data from HDFS how we can impact SQL Server by moving data there from HDFS, so DBAs should account for this. An example of a business case that could be an impetus for this is when trending statistics are being captured on data coming into HDFS, and the outputs of the trends are sent to SQL Server to drive application behaviors. We didn’t show how unstructured data would be stored in HDFS, or how it might be used, but this would require more work with analytic software and algorithms than it does with SQL Server, so we’ll defer that demonstration to the analytics books. To see an example of how Hadoop works with semistructured text data, Microsoft provides a word count example under the Samples gallery on the Hadoop on Azure web site.

The scripts we executed didn’t run at hyper speed like you might have expected. They were slow, often much slower than the queries and bulk load processes we’re used to when using SQL Server. The samples we developed worked with a small amount of data, but don’t forget that this chapter is about Big Data and the Hadoop framework shines when processing very large datasets. Firing up 128 servers to execute MapReduce jobs to acquire and assemble the data properly is too much system overhead for small data than the benefit it provides through massive parallelization. Even using two servers was too much overhead for the amount of data we had. However, you’ll see the benefit when you scale up to billions and trillions of data elements because the resources used to start and execute MapReduce jobs is small stuff compared to the gain from massive parallelization.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.203.134