Big data governance is essential for maintaining the data quality and allowing analysts to make better decisions. It will enable financial organizations to avoid the costs associated with low-quality data re-work and reporting in compliance with regulations, such as Sarbanes-Oxley and Basel II/Basel III.
Big data governance should include at a minimum:
For any large big data program, I recommend a three-tiered approach on governance, as mentioned next:
Once you have implemented couple of big data systems on Hadoop, your clusters can have hundreds to thousands of Oozie coordinator jobs and so many dataset and process definitions. This becomes too difficult to manage and results in common mistakes, such as duplicate datasets and processes, incorrect job execution, and lack of audit control and traceability.
Apache Falcon is one of the leading tools for administrators and data stewards to address the data governance challenges.
I will provide a brief overview of this tool, but please visit http://falcon.apache.org for more details.
Hadoop distribution companies come up with their own data governance tools as well, with similar features. For example, Cloudera has its own data governance tool called Navigator. As Hortonworks is completely open source, they use Falcon in their distribution.
Apache Falcon is a data management tool that defines, schedules, and monitors data processing elements.
It defines three types of entities in a simple XML format that can be combined to describe data management policies.
falcon entity -type cluster -submit -file <cluster-entity>.xml
falcon entity -type feed -submit -file <feed-entity>.xml falcon entity -type feed -schedule -name <feed-entity>
falcon entity -type feed -submit -file <process-entity>.xml falcon entity -type feed -schedule -name <process-entity>
The configuration parameters allow a rich set of data management polices, including late data arrival, replication across clusters, and so on.
Falcon is a distributed application and its servers can be deployed across multiple clusters, if needed. It transforms entity definitions into repeatable actions using Oozie as its scheduler and its high-level architecture is shown next:
3.145.9.148