Data collection system requirements

When designing any system, the first thing to do is to list down exactly what is required from that system. This not only focuses the design effort, but also helps eliminate any differences between various stakeholders.

So, let's list some of the major functional and nonfunctional requirements for the Data-Collection System:

The System should support both Pull-based as well as Push-based data collection

With pull-based data collection, we should be able to pull data, at regular intervals, from a given source. Most of the systems we collect data from are usually pull-based. They expose APIs that our data-collection system can utilize to pull data in an efficient manner.

Push-based data-collection requires the data-collection system to expose an API that the end users can use to push the data into the System. This kind of system is much more complicated to implement if the API must be exposed to the outside world. The reason is that you then need to implement aspects such as authentication/authorization, rate-limiting, and graceful degradation of service. Usually, a push-based collection API is implemented for trusted clients, be it internal or partners. This is usually an exception rather than a norm. The simple reason around it being an exception is that the data being collected will be used to derive inferences that will potentially benefit your business and thus you should be in control of what you need.

Data collectors should be developed easily without disrupting the existing execution

You will collect data from various sources with a varying degree of frequency and the collected data will feed into your business-processing pipeline. You do not want to disturb your existing business processes just because a new collector needs to be deployed. Thus the data collectors should be independent of each other as well as have their own develop/deploy/release cycle.

The system should be able to schedule tasks reliably on behalf of the end user

This requirement is somewhat an extension of requirement 1, the pull-based collection. The Collection System should be able to provide a convenient means for the end user to not only schedule collection tasks but to also query the status of each execution with the associated metadata. Having such a component helps end users keep a tab on the collection processes as well as develop an Audit trail.

The System should have support for pulling data using various protocols, such as HTTP(s), FTP, REST, SOAP, and AMQP

Since we do not know what protocol the underlying data source would expose, we should be able to easily extend and embrace any protocol that comes our way with minimum redesign.

The System should be technology-agnostic

Anybody should be able to develop a Data Collector and it should work seamlessly with the overall system. This becomes important in situations where you do not have a selected programming language, the team is distributed across globe, or the team members have different set of skills. Instead of molding the developers around the system, the system should be flexible enough to be utilized by every developer, irrespective of its specific language skills. This is a very important point as we will shortly see.

Failure-Isolation should be built into the system

If a specific (set of) data collectors fail, the entire system should not fail and other collectors should be able to continue working without any problems. This directly implies that the system should be loosely coupled.

There may be other requirements specific to your needs, but for the majority of the systems these requirements serve as a solid ground for designing a data-collection system.

Having the guiding force of requirements behind us, let's start delving now into converting these requirements in to an architecture. As you have learned by now, any architecture definition typically involves defining the architectural principles, assumptions, capabilities, components, and deployment information. There are multiple ways of defining the architecture of the system. For a complex system, it makes sense to track each aspect of the architecture against the requirement listed by the stakeholder. For example, if we list the distributed system as an architectural principle, it links to requirement number 2 and 6 directly. Such an approach gives you clarity as well as helps you follow the right direction and/or do course-correction without incurring overhead.

But for this chapter, we will keep things simple. We will list the architectural principles in brief, we will assume certain things, and then we will design our system components.

Table of Contents for Data&#xA0;collection system requirements

Create new playlist

Sign In

Sign Up

Table of Contents for
Data collection system requirements