Quality assurance

With an initial data ingestion capability implemented, and data streaming onto your platform, you will need to decide how much quality assurance is required at the "front door". It's perfectly viable to start with no initial quality controls and build them up over time (retrospectively scanning historical data as time and resources allow). However, it may be prudent to install a basic level of verification to begin with. For example, basic checks such as file integrity, parity checking, completeness, checksums, type checking, field counting, overdue files, security field pre-population, denormalization, and so on.

You should take care that your up-front checks do not take too long. Depending on the intensity of your examinations and the size of your data, it's not uncommon to encounter a situation where there is not enough time to perform all processing before the next dataset arrives. You will always need to monitor your cluster resources and calculate the most efficient use of time.

Here are some examples of the types of rough capacity planning calculations you can perform:

Example 1 - Basic quality checking, no contending users

  • Data is ingested every 15 minutes and takes 1 minute to pull from the source
  • Quality checking (integrity, field count, field pre-population) takes 4 minutes
  • There are no other users on the compute cluster

There are 10 minutes of resources available for other tasks.

As there are no other users on the cluster, this is satisfactory - no action needs to be taken.

Example 2 - Advanced quality checking, no contending users

  • Data is ingested every 15 minutes and takes 1 minute to pull from the source
  • Quality checking (integrity, field count, field pre-population, denormalization, sub dataset building) takes 13 minutes
  • There are no other users on the compute cluster

There is only 1 minute of resource available for other tasks.

You probably need to consider, either:

  • Configuring a resource scheduling policy
  • Reducing the amount of data ingested
  • Reducing the amount of processing we undertake
  • Adding additional compute resources to the cluster

Example 3 - Basic quality checking, 50% utility due to contending users

  • Data is ingested every 15 minutes and takes 1 minute to pull from the source
  • Quality checking (integrity, field count, field pre-population) takes 4 minutes (100% utility)
  • There are other users on the compute cluster

There are 6 minutes of resources available for other tasks (15 - 1 - (4 * (100 / 50))). Since there are other users, there is a danger that, at least some of the time, we will not be able to complete our processing and a backlog of jobs will occur.

When you run into timing issues, you have a number of options available to you in order to circumvent any backlog:

  • Negotiating sole use of the resources at certain times
  • Configuring a resource scheduling policy, including:
    • YARN fair scheduler: allows you to define queues with differing priorities and target your Spark jobs by setting the spark.yarn.queue property on start-up so your job always takes precedence
    • Dynamic resource allocation: allows concurrently running jobs to automatically scale to match their utilization
    • Spark scheduler pool: allows you to define queues when sharing a SparkContext using a multithreading model, and target your Spark job by setting the spark.scheduler.pool property per execution thread so your thread takes precedence
    • Running processing jobs overnight when the cluster is quiet

In any case, you will eventually get a good idea of how the various parts to your jobs perform and will then be in a position to calculate what changes could be made to improve efficiency. There's always the option of throwing more resources at the problem, especially when using a cloud provider, but we would certainly encourage the intelligent use of existing resources - this is far more scalable, cheaper, and builds data expertise.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.104.27