Chapter 7. Scale It Up

We have used Hadoop for a project or two and reaped benefits from it. Now that the benefits are proven, you will be tempted to use Hadoop to solve more big data problems.

In this chapter, I will explain the subjects around scaling your one-off projects to a full organizational Hadoop cluster supporting multiple systems and projects. I will cover topics based on the following points:

  • Why and how do you scale your infrastructure
  • More big data finance use cases with brief solutions
  • Enterprise data lake
  • Lambda architecture
  • Big data governance
  • Security

Scale it up – actually horizontally

Determining the appropriate level of business engagement for any big data project is an important consideration to ensure success. If it proves too difficult to manage expectations across business divisions, there is no harm in getting started with your own department-level Hadoop cluster and delivering successful projects; and later on, other divisions will surely hook on to your cluster to save costs.

However, if you ask the banking group CIO, they will be very critical of department-level Hadoop clusters. They are more likely to recommend data lake, a banking-group-level enterprise platform. Although it is difficult to recommend the best route, it is always best to get started as soon as the business case is approved.

Once a few successful Hadoop systems are in production, it will be a good time to leverage the investment made in Hadoop by migrating other data and analytics systems into Hadoop.

We need to have a big data strategy with a portfolio of projects on Hadoop ecosystem. The key points are as follows:

  • Projects with benefits: Always implement projects with clear benefits, and not just because Hadoop is a cool thing to do.
  • Starting with small projects: As Mr. Doug Cutting said, "for big data, start small". Always start with small projects, such as migrating something from cloud or offload data storage to Hadoop. It is quite common for banks to archive their daily transactions, positions, balances, risks, valuations, and majority of data into the central Hadoop system.
  • Data lake: A centralized enterprise platform for Hadoop is becoming very popular. The most common problem nowadays is that after few successful projects, all business functions end up with their own Hadoop clusters with different versions and different distributions. Data Lake addresses this problem for the so-called Hadoop silos.
  • Lambda architecture: This is a simple but powerful architecture pattern to combine batch processing with real-time processing and present a unified, complete data layer to the business in real time. I will discuss this in a little more detail later.
  • Data governance: As big data continues to grow within financial organizations, data governance has become difficult. Big data also deals with sensitive customer information and confidential records; and the security of data needs to be ensured. I will discuss this in a little more detail later, including the Apache Falcon tool, which is an excellent data management tool for Hadoop.
  • Security and privacy: Data security measures are taken to prevent unauthorized access and corruption, and data privacy deals with unwanted information sharing with internal and external parties. Security and privacy is something that cannot be compromised in any financial organization. I will discuss this in a little more detail later.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.157.6