Developing a progression process

Data science is all about experimentation. IoT analytics has great potential. However, no one is certain of how it will develop. Put them both together and an enormous amount of experimentation should be expected. To keep this from becoming an impossible-to-manage mess, there needs to be a way to progress ad hoc datasets from early development into repeatable and stable data products.

Setting up a progression process will help manage this. Data science is highly iterative, which makes it difficult to find a clear point that signals a change in state, as with normal database development projects.

A way to handle this is by setting up regular review periods where a team determines which datasets are ready to progress to the next stage of development. Decide how often this should happen, define the stages and the corresponding requirements for each, all based on your unique situation.

We will review a suggested progression path that you can tailor to the needs of your unique IoT analytics environment.

Segment your data lake into three general areas:

Sandbox: A data scientist has full access to read and write in this area. They may have a sandbox to themselves in addition to one for the team. This is for initial experimentation and model development.
Mature: A data scientist has full access to this area but does not have their own mature environment, the team shares it. All code and scripts used to generate datasets kept in this area should be under source code control.
Production: A data scientist has full read access but no write capability. Datasets in this area have been fully tested, code and scripts used to generate datasets are under source code control, and a change control process is in place.

The suggested progression process to move datasets between areas is here:

Establish a regular recurring review of your datasets: This could be every month, every quarter, or semi-annually. Use whichever makes the most sense for how quickly your IoT analytics is developing. But the review should be regular and enforced. It is too easy to delay due to the small iterations that occur with analytics. Avoid this trap and force your team to do it.
Review the datasets in all three areas for any that should be deleted: These could be development datasets that never made it far, or old versions of ones that are now in production. Take a cue from Java and have a regular garbage collection event to keep your data lake optimized.
Review Sandbox datasets that are ready to move to the Mature area: These are the ones that are either project specific and ready for team testing, or are useful to the team for their general work. The latter should be set up as regularly recurring and scheduled jobs to build the datasets. This type of datasets will help accelerate many projects in the future.
Review Mature datasets for ones that are ready to move to Production: These are typically more project-specific and have passed all the testing. Once it is moved into production, future changes should be minimal. At this point, control of the dataset should be handed over to a separate group to maintain and provide service-level support.

Table of Contents for Developing a progression process

Create new playlist

Sign In

Sign Up

Table of Contents for
Developing a progression process