Hadoop is not going to replace traditional data systems for the foreseeable future. We will need to maintain a balanced approach with a range of tools and technologies, including Hadoop, traditional grid computing, pushdown optimization in RDBMS and ETL tools, on-premise computing, cloud computing, and even your mainframe.
As discussed in the last chapter, both structured and unstructured data can co-exist in a data lake architecture using both traditional and new technologies. Enterprise Data Warehouse (EDW) still remains the best cake for management, and will be for the foreseeable future of structured data. Even the unstructured data, once cleansed, filtered, and analyzed using Hadoop, can make its way to EDW as a central golden source of data.
Just like any non-Hadoop system, you will generally need at least three environments—development, testing, and production. However, if these environments differ in specification, the processing performance will differ as well.
As Hadoop is generally about big data, keeping three similar specification environments will multiply your hardware cost by three times. So, few organizations may choose to keep less number of data nodes on development and testing environments. In that case, you may be expected to keep a cut-down version of data on development and testing environments. It is also common to use sandboxes for development, virtualized clusters for testing, and bare metal installation for production.
For example, you may want to keep only trading data and risk metrics for only certain dates, unlike last five years of data on production.
There will always be few BI and ETL tools that are a step ahead compared to others in terms of integration with Hadoop, but this doesn't mean that already purchased in-house BI and ETL tools cannot be used. I strongly recommend sticking with existing BI and ETL tools unless there is a really good reason to buy another one. However, if there is a possibility of reducing the number of BI and ETL tools to reduce the cost of licenses and hardware, then it should be done.
As long as you can connect your BI tool with an up-to-date, supported ODBC/JDBC driver for Hive/Spark SQL, you can continue using your existing tool.
Also, most of the ETL tools will allow a connector to Hadoop, which eliminates the hand coding of MapReduce jobs. If this means buying an additional option from the ETL vendor, it would still be more financially viable than buying a new tool.
These tips are like lessons learned from our experience with big data projects—categorized into business, infrastructure, and development.
The following points will be useful for business users, managers, and analysts:
The following points will be useful for administrators and developers:
The following points will be useful for administrators and developers:
org.apache.hadoop.mapred
and org.apache.hadoop.mapreduce
. Use the new one, org.apache.hadoop.mapreduce
consistently, as the other one is for backward compatibility only.3.135.247.68