Thoughts on Data Auditing

In perspective of Data Lake, auditing is quite an important feature needed. The data comes from various sources, various departments, various asset classification (secret, public and so on) and just because of these variations, some data requires special security requirements and handling. Certain data in the lake need tracking of changes that it undergo as well as who accesses that for various legal and contractual aspects.

In the source system, data is kept for time it is really necessary to carry day to day activity (production period). After that, the data is usually categorized as non-production in nature and archived or taken offline. For a Data Lake, there isn't really a concept of archived data and because of this the data needs access control and auditing (changes that it undergoes like various transformation and so on) at all times. Not all data in the lake might require this, but some data does require it and have to be dealt with.

Doing this is a big ask but it will benefit in long run, especially for data that is categorised as highly secure. Auditing requires capturing of old data and the changed (new) data, along with some metadata such as who has done the change, when was it done and so on.

The Data Lake as detailed earlier could be zoned according to data asset classification (high, medium, low) and then auditing can be enabled for data that demands it. Once the auditing is enabled, according to certain rules configured, the lake should be capable of triggering appropriate alerts to admins and also produce reports showing risks and compliance as the case may be.

Having all the preceding capabilities completes the auditing requirement. To recap, these are the ones:

  • Appropriate controls to access data
  • Tracking data change
  • The capability to trigger risks based on configured rules

Configuring Apache Atlas (we briefly discussed this technology earlier) along with Apache Ranger (again in security section we discussed this technology in brief) could give us the data auditing capability that we are looking for.

Atlas does the necessary auditing function and Ranger can do the authorization aspect for the data in the Data Lake.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.