With an initial data ingestion capability implemented, and data streaming onto your platform, you will need to decide how much quality assurance is required at the "front door". It's perfectly viable to start with no initial quality controls and build them up over time (retrospectively scanning historical data as time and resources allow). However, it may be prudent to install a basic level of verification to begin with. For example, basic checks such as file integrity, parity checking, completeness, checksums, type checking, field counting, overdue files, security field pre-population, denormalization, and so on.
You should take care that your up-front checks do not take too long. Depending on the intensity of your examinations and the size of your data, it's not uncommon to encounter a situation where there is not enough time to perform all processing before the next dataset arrives. You will always need to monitor your cluster resources and calculate the most efficient use of time.
Here are some examples of the types of rough capacity planning calculations you can perform:
There are 10 minutes of resources available for other tasks.
As there are no other users on the cluster, this is satisfactory - no action needs to be taken.
There is only 1 minute of resource available for other tasks.
You probably need to consider, either:
There are 6 minutes of resources available for other tasks (15 - 1 - (4 * (100 / 50))). Since there are other users, there is a danger that, at least some of the time, we will not be able to complete our processing and a backlog of jobs will occur.
When you run into timing issues, you have a number of options available to you in order to circumvent any backlog:
spark.yarn.queue
property on start-up so your job always takes precedenceSparkContext
using a multithreading model, and target your Spark job by setting the spark.scheduler.pool
property per execution thread so your thread takes precedence
In any case, you will eventually get a good idea of how the various parts to your jobs perform and will then be in a position to calculate what changes could be made to improve efficiency. There's always the option of throwing more resources at the problem, especially when using a cloud provider, but we would certainly encourage the intelligent use of existing resources - this is far more scalable, cheaper, and builds data expertise.
18.226.104.27