Chapter 9. Working with Relational Databases

As we saw in the previous chapter, Hive is a great tool that provides a relational database-like view of the data stored in Hadoop. However, at the end of the day, it is not truly a relational database. It does not fully implement the SQL standard, and its performance and scale characteristics are vastly different (not better or worse, just different) from a traditional relational database.

In many cases, you will find a Hadoop cluster sitting alongside and used with (not instead of) relational databases. Often the business flows will require data to be moved from one store to the other; we will now explore such integration.

In this chapter, we will:

  • Identify some common Hadoop/RDBMS use cases
  • Explore how we can move data from RDBMS into HDFS and Hive
  • Use Sqoop as a better solution for such problems
  • Move data with exports from Hadoop into an RDBMS
  • Wrap up with a discussion of how this can be applied to AWS

Common data paths

Back in Chapter 1, What It's All About, we touched on what we believe to be an artificial choice that causes a lot of controversy; to use Hadoop or a traditional relational database. As explained there, it is our contention that the thing to focus on is identifying the right tool for the task at hand and that this is likely to lead to a situation where more than one technology is employed. It is worth looking at a few concrete examples to illustrate this idea.

Hadoop as an archive store

When an RDBMS is used as the main data repository, there often arises issues of scale and data retention. As volumes of new data increase, what is to be done with the older and less valuable data?

Traditionally, there are two main approaches to this situation:

  • Partition the RDBMS to allow higher performance of more recent data; sometimes the technology allows older data to be stored on slower and less expensive storage systems
  • Archive the data onto tape or another offline store

Both approaches are valid, and the decision between the two often rests on just whether or not the older data is required for timely access. These are two extreme cases as the former maximizes for access at the cost of complexity and infrastructure expense, while the latter reduces costs but makes data less accessible.

The model being seen recently is for the most current data to be kept in the relational database and the older data to be pushed into Hadoop. This can either be onto HDFS as structured files or into Hive to retain the RDBMS interface. This gives the best of both worlds, allowing the lower-volume, more recent data to be accessible by high-speed, low-latency SQL queries, while the much larger volume of archived data will be accessed from Hadoop. The data therefore remains available for use cases requiring either types of access; this would be needed on a platform that does require additional integration for any queries that need to span both the recent and archive data.

Because of Hadoop's scalability, this model gives great future growth potential; we know we can continue to increase the amount of archive data being stored while retaining the ability to run analytics against it.

Hadoop as a preprocessing step

Several times in our Hive discussion, we highlighted opportunities where some preprocessing jobs to massage or otherwise clean up the data would be hugely useful. The unfortunate fact is that, in many (most?) big data situations, the large volumes of data coming from multiple sources mean that dirty data is simply a given. Although most MapReduce jobs only require a subset of the overall data to be processed, we should still expect to find incomplete or corrupt data across the data set. Just as Hive can benefit from preprocessing data, a traditional relational database can as well.

Hadoop can be a great tool here; it can pull data from multiple sources, combine them for necessary transformations, and clean up prior to the data being inserted into the relational database.

Hadoop as a data input tool

Hadoop is not just valuable in that it makes data better and is well suited to being ingested into a relational database. In addition to such tasks, Hadoop can also be used to generate additional data sets or data views that are then served from the relational database. Common patterns here are situations such as when we wish to display not only the primary data for an account but to also display alongside it secondary data generated from account history. Such views could be summaries of transactions against types of expenditure for the previous months. This data is held within Hadoop, from which can be generated the actual summaries that may be pushed back into the database for quicker display.

The serpent eats its own tail

Reality is often more complex than these well-defined situations, and it's not uncommon for the data flow between Hadoop and the relational database to be described by circles and arcs instead of a single straight line. The Hadoop cluster may, for example, do the preprocessing step on data that is then ingested into the RDBMS and then receive frequent transaction dumps that are used to build aggregates, which are sent back to the database. Then, once the data gets older than a certain threshold, it is deleted from the database but kept in Hadoop for archival purposes.

Regardless of the situation, the ability to get data from Hadoop to a relational database and back again is a critical aspect of integrating Hadoop into your IT infrastructure. So, let's see how to do it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.42.116