Best practices and standards

Hadoop is not going to replace traditional data systems for the foreseeable future. We will need to maintain a balanced approach with a range of tools and technologies, including Hadoop, traditional grid computing, pushdown optimization in RDBMS and ETL tools, on-premise computing, cloud computing, and even your mainframe.

As discussed in the last chapter, both structured and unstructured data can co-exist in a data lake architecture using both traditional and new technologies. Enterprise Data Warehouse (EDW) still remains the best cake for management, and will be for the foreseeable future of structured data. Even the unstructured data, once cleansed, filtered, and analyzed using Hadoop, can make its way to EDW as a central golden source of data.

Environments

Just like any non-Hadoop system, you will generally need at least three environments—development, testing, and production. However, if these environments differ in specification, the processing performance will differ as well.

As Hadoop is generally about big data, keeping three similar specification environments will multiply your hardware cost by three times. So, few organizations may choose to keep less number of data nodes on development and testing environments. In that case, you may be expected to keep a cut-down version of data on development and testing environments. It is also common to use sandboxes for development, virtualized clusters for testing, and bare metal installation for production.

For example, you may want to keep only trading data and risk metrics for only certain dates, unlike last five years of data on production.

Integration with the BI and ETL tools

There will always be few BI and ETL tools that are a step ahead compared to others in terms of integration with Hadoop, but this doesn't mean that already purchased in-house BI and ETL tools cannot be used. I strongly recommend sticking with existing BI and ETL tools unless there is a really good reason to buy another one. However, if there is a possibility of reducing the number of BI and ETL tools to reduce the cost of licenses and hardware, then it should be done.

As long as you can connect your BI tool with an up-to-date, supported ODBC/JDBC driver for Hive/Spark SQL, you can continue using your existing tool.

Also, most of the ETL tools will allow a connector to Hadoop, which eliminates the hand coding of MapReduce jobs. If this means buying an additional option from the ETL vendor, it would still be more financially viable than buying a new tool.

Tips

These tips are like lessons learned from our experience with big data projects—categorized into business, infrastructure, and development.

Business

The following points will be useful for business users, managers, and analysts:

  • Prefer incremental projects in increasing order of complexity but reducing order of ROI.
  • Always share big data analytical success results with internal and external stakeholders. This keeps up the Hadoop momentum with business and sponsorship coming.
  • Big data projects that use Hadoop work best with agile methodology. Financial organizations should change existing project methodologies, if required.
  • Perform big data prototype on a public cloud using anonymized data, which can be done instantly, and then move to a private cloud, which takes longer to set up.
  • The financial global data come from a variety of systems and should be designed to meet the diversity across regions regarding language, regulations, currency, time zone, and so on.
  • Organization roles and processes must be amended to include stewardship roles for new types of data, and extend their role in compliance and internal control to the ethical and effective stewardship of data assets.
  • Given an option, use tools to do anything, but only if it is a widely adopted tool in an enterprise environment with guaranteed future support. Don't just download anything from the Internet and use it for your project.

Infrastructure

The following points will be useful for administrators and developers:

  • Keep the HDFS block size large. The setup cost is more than the actual disk cost.
  • Keep the optimum number of threads per node, depending on the type of jobs you run. Refer to the manuals to find out how to do it.
  • Transmission or IO is one of the most expensive. Use compression such as LZO and LZ4, and keep the output of Map phase very low. Use a combiner, if that helps.
  • Use MapReduce V2 from the first day if you are developing a new project from scratch. So, ensure that the infrastructure is in place for V2 or YARN.
  • Don't get influenced by tools claiming 2.5 times or 100 times faster performance; always read the small text. It may be under certain circumstances only.
  • In our opinion, it's not worth buying any tool that claims it can migrate from a database or other systems at the click of a button. If it is straight migration of data without transformation, then it is straightforward using Hadoop components as well.
  • Don't measure success by number of clusters and nodes installed on your Hadoop ecosystem.

Coding

The following points will be useful for administrators and developers:

  • The data must be loaded by using an automated and repeatable process, adhering to good naming conventions.
  • Data quality validation and monitoring should be included into the Map Reduce, Hive, or Pig code. If not, then the data quality will eventually become another non-ending project.
  • Apply filtering, cleansing, pruning, conforming, matching, joining, and diagnosing at the earliest touch points possible in the data flow cycle.
  • Unit test all your mappers, reducers, combiners, and partitioners individually, as you would do with any of code previously. Each step must be tested with a small and diverse dataset. First, think of functionality and then performance.
  • There are two MapReduce APIs: org.apache.hadoop.mapred and org.apache.hadoop.mapreduce. Use the new one, org.apache.hadoop.mapreduce consistently, as the other one is for backward compatibility only.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.191.86