What constitutes a data ecosystem?

Nowadays, data comes from a variety of sources, at varying speeds, and in a number of different formats. Understanding data and its relevance is the most important task for any data-driven organization. 

To understand the importance of data, the data czars in an organization should look at all possible sources of data that may be important to them. Being far-sighted helps, although, given the pace of modern society, it is almost impossible to gather data from every relevant source. Hence, it is important that the person/people involved in identifying relevant data sources are also well aware of the business landscape in which they operate. This knowledge will help tremendously in averting problems later. Data source identifiers should also be aware that data can be sourced both inside and outside of an organization, since, at the broadest level, data is first classified as being either internal or external data.

Given the opportunity, internal data should first be converted into information. Handling internal data first helps the organization to understand its importance early in the life cycle, without needing to set up a complex system, thereby making the process agile. In addition, it also gives the technical team an opportunity to understand what technology and architecture would be most appropriate in their situation. Such a distinction also helps organizations to not only put a reliability rating on data, but also to define any security rules in connection with the data.

So, what are the different sources of data that an organization can utilize to its benefit? The following diagram depicts a part of the landscape that constitutes the data ecosystem. I say "a part" because the landscape is so huge that listing all of them would not be possible:

The preceding mentioned data can be categorized as internal or external, depending upon the business segment in which an organization is involved. For example, as regards an organization such as Facebook, all the social media-related data on its website would constitute an internal source, whereas the same data for an advertising firm would represent an external source of data.

As you may have already noticed, the preceding set of data can broadly be classified into three sub-categories:

Structured data

This type of data contains a well-defined structure that can be parsed easily by any standard machine parser. This type of data usually comes with a schema that defines the structure of the data. For example, incoming data in XML format having an associated XML schema constitutes what is known as structured data. Examples of such data include Customer Relationship Management (CRM) data, and ERP data.

Semi-structured data

Semi-structured data consists of data that does not have a formal schema associated with it. Log data from different machines can be regarded as semi-structured data. For example, a firewall log statement consists of the following fields as a minimum: the timestamp, host IP, destination IP, host port, and destination port, as well as some free text describing the event that took place resulting in the generation of the log statement. 

Unstructured data

Finally, we have data that is unstructured. When I say unstructured, what I really mean is that, looking at the data, it is hard to derive any structured information directly from the data itself. It does not mean that we can't get information from the unstructured data. Examples of unstructured data include video files, audio files, and blogs, while most of the data generated on social media also falls under the category of unstructured data.

One thing to note about any kind of data is that, more often than not, each piece of data will have metadata associated with it. For example, when we take a picture using our cellphone, the picture itself constitutes the data, whereas its properties, such as when it was taken, where it was taken, what the focal length was, its brightness, and whether it was modified by software such as Adobe Photoshop, constitutes its metadata.

Sometimes, it is also difficult to clearly categorize data. For example, the scenario where a security firm that sells hardware appliances to its customers that is installed at the customer location and collects access log data constitutes one such scenario where it is difficult to categorize data. It is data for the end customer that the customer has given permission to be used for a specific purpose and that is used to detect a security threat. Thus, even though the data resides at the security organization, it still cannot be used (without consent) for any purpose other than to detect a threat for that specific customer.

This brings us to our next topic: data sharing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.108.68