11
Putting All Together

After reading this chapter, you should be able to

  • Compare Big Data platform design alternatives
  • Review Big Data systems and tools
  • Discuss various Big Data challenges

The previous chapters have discussed many Big Data platforms' components and modern Big Data platform technology requirements. The technologies used in different settings and the tools and systems addressing Big Data platform demands are explored. This chapter discusses the challenges while designing Big Data platforms.

11.1 Platforms

Designing and building a Big Data platform requires decisions. Working solutions are available and often used by organizations to deal with Big Data. However, these solutions might be incomplete, incoherent, and hard to maintain. An overall vision for the platform would help organizations. Nevertheless, shifting from systems to platform requires effort and dedication from the executive management and other parties involved.

Big Data platforms focus on infrastructure. One of the current key factors is the deployment model, which includes in‐house solutions that require additional maintenance costs due to the provisioning of resources and talent. On the other hand, an infrastructure created by a cloud provider can be used to build the platform. The last alternative is to employ a hybrid infrastructure where some resources and systems are available from the cloud provider. The rest of the infrastructure is an in‐house solution.

The key element is the long‐term cost and agility of the solution adopted. If a cloud provider is chosen and later becomes overly costly, it was a bad decision. On the other hand, if an in‐house solution was chosen, keeping up with new technology, e.g. upgrades or maintenance, the system would not be agile enough. Hence, fundamental infrastructure decisions should be made under scrutiny as the systems' current state and infrastructure play an important role in decisions. In this section, the comparison of different infrastructure choices for Big Data platforms is discussed.

11.1.1 In‐house Solutions

An in‐house solution requires provisioning, upgrade, maintenance, and support of the software stack regardless of where it is deployed. In‐house solution can be employed on top of on‐premise hardware or managed resources by cloud providers. When utilizing an in‐house solution, there is full control over the software. Depending on the needs, the software can be patched or adapted to the environment. The flexibility comes with a cost. Any system that a company maintains needs administration. The administration level depends on the software, but the administration cost is always part of the problem. Moreover, the level of administration depends on the infrastructure that software runs on. If it is an on‐premise infrastructure, administrators need to worry about hardware failures as well as software problems.

11.1.1.1 Cloud Provisioning

Cloud provisioning of hardware makes it easy to scale by adding more nodes to the infrastructure through cloud provider API or user interface. It is also possible to create cloud provider‐specific images and deploy them automatically. The software updates or maintenance of the overall system is relatively easier as it does not cost anything to remove virtual hardware. If one or more nodes in the overall system get sick for any reason, the administrator can replace the sick node with a new virtual node. Most Big Data solutions are horizontally scalable. Adding/removing nodes is not burdensome.

Although cloud‐based hardware provisioning is comfortable, the software still needs maintenance and support. A dedicated team of administrators to support the rest of the company is needed. There would be onboarding processes for new team members and time to ramp‐up with the infrastructure built. What's more, cloud‐based hardware provisioning is costlier than actual hardware. The ability to easily provision new nodes comes with an overall cost on the budget because virtual hardware is not cheap. It often can be an order of magnitude expensive than traditional hardware.

11.1.1.2 On‐premise Provisioning

On‐premise hardware provisioning requires a different level of expertise in both technical and nontechnical abilities. It requires knowledge of Big Data systems software and hardware configuration about servers, racks, and switches. It also entails new roles, such as data center technicians. Ultimately, on‐premise provisioning often needs an upfront investment and ongoing planning for growth.

The maintenance of bare metal hardware involves different processes and automation. There is hardware provisioning software such as ansible, chef, and terraform to define infrastructure as a code. Moreover, hardware provisioning requires experience in hardware and negotiation with hardware providers. Hardware warranty, support, and the price are all part of the maintenance process that the company needs to support.

Hardware provisioning also requires some physical work. One or more data center technicians are needed to put together new racks, switches, and servers. In the case of hardware problems in various components, they have to be engaged. Moreover, hardware problems might need to be addressed immediately. Having such requirements end up having on‐call duties for technicians and administrators.

If the company does not have an existing on‐premise infrastructure, it is really hard to invest in such an area just for Big Data platforms. If the company has to maintain its infrastructure due to scale or compliance, it is easier to convince the executive team to sponsor on‐premise Big Data systems. Even after the initial investment, the team has to plan for the growth and maintenance costs such as hardware failures. The growth can be particularly estimated due to trends unless the company sees huge spikes in usage.

Although it seems quite painful to have an on‐premise infrastructure for Big Data platforms, it can also be very cost‐efficient. Once the hardware is bought, the rest of the expense is pretty much electricity. If the company already has expertise in maintaining infrastructure, it would not be a huge undertaking to provision a new cluster of machines. Furthermore, the performance of bare metal servers can be far better than virtual hardware. With on‐premise hardware, the company also gets a chance to upgrade physical hardware such as disks while maintaining some parts.

Bare metal instances are also available from cloud providers. Nevertheless, they are mostly about renting the hardware, not purchasing it. Purchasing actual hardware and tuning it keeps the cost down. Otherwise, it would not make a big difference and might not be a good choice given that there are additional maintenance problems.

11.1.2 Cloud Providers

Cloud providers implement Big Data systems on top of open‐source software solutions. They offer similar systems with pre‐configured parameters optimized to run on their infrastructure. Additionally, they provide data storage facilities, such as object storage and data warehousing solutions. Most cloud solutions integrate very well within the boundaries of a cloud provider. Integration points can exist either through workflow solutions or some scripting that runs on serverless computing. There are many reasons why a cloud provider might be a good fit for Big Data platforms, such as affordability, time‐to‐market, ongoing cost, talent, and scalability.

The upfront investment cost for Big Data systems might be a big problem for smaller organizations. Investing little money on infrastructure for a small but fast‐growing organization might become a hurdle. Moreover, there is no guarantee that the overall platform would work for the organization. At times, it might be wiser to buy a service rather than implementing the solution itself. In such cases, bashing the existing platform and outsourcing Big Data needs to a provider might satisfy the needs.

Cloud providers make it very easy to build a Big Data platform rapidly. Many recipes are available to build data storage systems, data processing, data discovery, and data science. These systems integrate seamlessly into a cloud provider. From data transfers to data processing, each cog is ready to implement the overall platform. Moreover, the iterative development of system components is possible. Therefore, building an minimum viable product (MVP) is easier. Once the MVP is built, scaling the platform is more natural as it is very probable that the systems underneath the platform can scale horizontally by adding nodes.

Cloud providers also charge the company a steady base for the expenses of the infrastructure. There is no upfront payment. At any point, the infrastructure can be destroyed, and there is no need to use it all day long. Spark application can be used to read data from the object store and save the result to a data warehouse. Hence, there is no reason to have a permanent provision of resources for such tasks, and the organization can save money.

Although cloud providers offer excellent benefits, they still have downsides like vendor lock‐in and outages.

11.1.2.1 Vendor Lock‐in

Cloud providers offer many solutions not only for Big Data but for many other areas. An organization that depends on the cloud provider can seamlessly use solutions within the cloud provider. Nevertheless, there is no standard for many solutions. Interoperability or portability between cloud vendors is not great either. Once a Big Data platform gets built‐in one cloud provider using their offering, it becomes increasingly hard to move to another provider and requires many technical and nontechnical challenges.

An organization can try to stay away from solutions that require tight integration with a cloud provider. Nonetheless, this strategy can limit the number of solutions available and might require additional software components to be built. A good solution might be an abstraction layer on top of cloud providers. Most cloud providers offer object stores with similar interfaces. A standard interface can be implemented for all common operations and integrated with the cloud provider. Hence, the cloud provider in the infrastructure would become a configuration. If the same strategy is applied to all components, the vendor lock‐in problem can disappear.

11.1.2.2 Outages

Cloud providers promise high uptime guarantees. In most cases, they even beat the uptime guarantees. However, the cloud provider can go offline for a service or a couple of services for a region. If the Big Data platform hosts critical components such as fraud detection, it could become a huge issue. Depending on the risks and budget, platform components can be designed to live in multiple cloud providers. Nevertheless, such systems would not be cheap in development and maintenance.

11.1.3 Hybrid Solutions

Hybrid solutions were briefly discussed in the Big Data storage chapter. On‐demand computing was the main approach. Cloud providers support on‐demand computing by creating and destroying a computing cluster. The on‐demand computing model helps to finish jobs quicker with a larger cluster and costs only when the job runs. The idea for a hybrid platform resembles the same model. Experimental setups can belong to cloud providers due to ease of development.

On‐demand computing enables the processing of offline tasks such as large daily ETLs or model training. Upon receiving results for these tasks, the output can be streamed down to an on‐premise cluster. The tricky part is to build tooling to support computing both on‐premise and cloud. Ideally, it should be invisible to data analysts, data engineers, and data scientists how the computing gets done. The only thing that matters is the output of the computing. Enabling on‐demand computing requires a setup where on‐premise and cloud providers can share data. For instance, an object store might share data between an on‐premise cluster and cloud solutions like data warehouses and data marts. Nevertheless, some of the needs might not be met very well with this model, such as stream processing.

Experimental projects can always leverage cloud provider solutions as a head start. Once the project gets concrete, maybe utilizing an on‐premise cluster or investing in a particular solution might become an idea. Cloud providers enable building solutions without taking much risk other than human resources. Nevertheless, it might be harder to implement the same infrastructure on‐premise cluster and stabilize it. Cloud providers offer better stability simply because they serve many customers and have to address many edge cases.

Although hybrid solutions provide cloud solutions access, it still has problems such as vendor lock‐in and downtime. Another alternative is to use containerized software to build solutions on top of Kubernetes.

11.1.3.1 Kubernetes

Kubernetes is a powerful open‐source container orchestration system. At a high level, Kubernetes is a system for running and orchestrating containerized applications over a cluster of machines. It manages the complete life cycle of containerized applications and has become the industry standard available to all cloud providers. This book has referred to containerized applications and Kubernetes several times. Kubernetes is becoming a popular system where Big Data applications can run easily. The advantage of running Big Data applications on Kubernetes comes from interoperability. For example, the same Spark job can run on‐premise Kubernetes or cloud providers.

Due to the rapid adoption of Kubernetes in many areas, the new version of Big Data solutions come with direct support to Kubernetes. Applications like Spark and Airflow support Kubernetes natively. Moreover, most data science applications come with direct integration to Kubernetes for training purposes. When it comes to streaming solutions, there is still some work to be done for Kubernetes adoption, but applications will support it better eventually. Thus, Kubernetes can provide an application layer that can run in any environment.

11.2 Big Data Systems and Tools

Throughout the book, many different Big Data tools and systems have been discussed. The amount of tooling available in the Big Data ecosystem keeps multiplying, and a Big Data platform needs some of them. It can potentially achieve similar results with a fraction of the systems discussed. When choosing systems and tooling, common sense and a prudent strategy must be used when adopting new technologies. The methods chosen should be solving business problems and getting good results. In this section, previously mentioned technologies and some new ones are discussed.

11.2.1 Storage

The amount of data that needs to be stored for Big Data is substantial. The solutions need to address both ease of access, cost, and resiliency. The storage system or systems should access tools and users to explore data with well‐defined access policies. Business pillars should expand the storage by buying hardware or service while keeping costs reasonably low. Storage is the base layer. The storage system chosen should replicate the data and seldom lose it. The system can get offline due to outages but should still recover, ignoring disaster scenarios. Given these requirements, there are two types of storing Big Data: one type is storing it for only offline purposes, and the second type is handling online/offline purposes together. The Big Data platform can use both solutions and have integrations between the two.

11.2.1.1 File‐Based Storage

A Big Data platform needs a foundational layer for storing data. File‐based storage systems address it suitably. The file‐based storage systems are fundamental for data lakes and data intake from many sources. File‐based storage systems are durable, cost‐efficient, and have granular access controls. There are two prevalent file systems like storage solutions: Hadoop distributed file system (HDFS) and object storages.

HDFS is designed to handle an enormous amount of data. If HDFS is configured to only store data with dense nodes, it can even go higher in the amount of data one node can store. With this setup, HDFS is responsible for data storage, and processing systems should stream data from HDFS to process it. One of the key aspects of realizing such a mechanism is to set up adequate bandwidth for nodes. Bandwidth is essential for recovery and streaming. The applications that stream data from HDFS should not get clogged because of the bandwidth. In case of failure, there is more data to stream back to a replacement node. Thus, bandwidth plays an important role in recovery time. Recovery will always take more time with dense nodes because a single node stores much more data.

The other alternative for file‐based storage systems is object stores. Object stores expose a file system like API with directories, client tools, web user interfaces, access control, and integration with many Big Data processing tools. The advantage of object stores is maintenance. Compared to HDFS based solution, object stores do not require much administration other than access control. They also have extra features like object tagging. As discussed in the storage chapter, an organization can leverage both HDFS and object store through connectors. However, the object store should be enough for most organizations if there are no compliance requirements.

11.2.1.2 NoSQL

Horizontally scalable data storage systems are vital for large‐scale distributed services. It is ideal for mirroring changes in the production and analytics database immediately in some scenarios. Luckily, some of the NoSQL solutions can easily mirror production data through different replication strategies. Technologies like Spark can connect to the analytics replicas through connector API and execute analytics queries. Cassandra, HBase, and MongoDB are well‐known NoSQL solutions that allow analytics processing easily. Moreover, cloud providers also have their in‐house solutions for NoSQL databases. In Appendix A, NoSQL solutions such as Cassandra are discussed.

11.2.2 Processing

Throughout the book, Big Data processing was reviewed from different aspects. Big Data processing involves different methodologies concerning the needs of the organization. Organizations might need near‐real‐time systems to make fast decisions, accurate computation of the data, or both at the same time. As discussed in the previous chapters, all data streams from different sources. The question is whether the system has to answer questions directly to a user request or in a period that is not bound to the context of user interaction.

Depending on the requirements, the Big Data platform might have to offer a solution that uses batch, stream, or a combination of two to serve. For example, pipelines are set up for machine learning, models are trained offline and deployed to production, streaming data is prepared, and prepared data is fed into the trained model to get results. The tricky part is often maintaining a view, understanding, and code sharing among the solution pipeline's different parts. Some techniques for Big Data processing with additional information about combination methods need to be discussed.

11.2.2.1 Batch Processing

Batch processing requires the accumulation of data for a window of time. The window size can vary between minutes to days. For near‐real‐time systems, the batch window is typically in the order of minutes or hours. For offline operations, the batch window could be in the order of days or more. Batch processing can run on huge data sizes as it can involve days' worth of data. Thus, computation technology should address the needs for recovery and performance.

If the system executes micro‐batches, the computing data in memory becomes a natural choice considering performance gains. With periodical checkpoints to recover from failures, the system can be robust enough. It might become important for extensive processing to write intermediate results to persistent storage to avoid massive computation between stages. Technologies like Spark can address both needs, while technologies like Hive are appropriate for huge tasks. Meanwhile, technologies like Presto is a good choice for medium‐size data as it cannot recover from failures. Apart from batch processing, stream processing engines can be leveraged for batch processing. Technologies like Flink can handle both stream and batch processing.

Regardless of the batch operation or technology, there are many common problems such as data skew, data type errors, and complexity. The data skew problem can happen due to anomalies, e.g. bots. There are a couple of ways to deal with data skew. One way is to partition data differently or apply techniques like salting. Another way is to preprocess outliers and remove them. For data type errors, a similar approach can be applied. However, if a field is an integer and some portion of data does not conform to integer values, they can be dismissed. Otherwise, the whole batch operation will be rejected. The complexity comes with many ETL pipelines feeding data from many sources. A review of pipelines from time to time to decrease complexity can help, but organizations usually would not have enough time.

11.2.2.2 Stream Processing

Stream processing can address many needs, such as anomalies, fraudulent transactions, and customer experience. Stream processing also depends on window sizes, where the window size is minimal, and its infrastructure generally involves a messaging layer and perhaps a processing engine. Messaging middleware also offers stream processing capabilities. Nevertheless, streaming engines give much more control and flexibility for what one can run.

Messaging middleware is the base for stream processing. Logs, feeds, and events can stream through messaging middleware. Logging solutions like FluentD can publish logs to message middleware. Applications can create feeds for tracking, monitoring, and so on. Besides, applications can create events for various user actions or flows. Messaging middleware solutions such as Kafka or Pulsar can do quick aggregations on these different types of messages. Although simple aggregations cover some surface area, it would not cater to all stream processing needs.

Stream processing engines can combine many streams and generate insights, alarms, and actions based on the aggregations. Solutions such as Heron, Flink, and Spark streaming integrates well with messaging middleware and other data sources. Therefore, they provide many different ways to aggregate data. Stream processing engines can serve a good range of infrastructure needs.

11.2.2.3 Combining Batch and Streaming

Many organizations need both batch and stream processing at the same time. Lambda architecture has been adopted solution for a while to address the combination of the two. Lambda architecture attempts to provide accuracy through batch processing and real‐time processing through stream processing. Further discussion about lambda architecture is in Appendix A. Although lambda architecture helps achieve both stream and batch processing, it is hard to maintain due to different code bases between batch and stream processing. Several technologies come to the scene to address both needs.

Stream engines like Flink can handle both stream and batch processing. Batch processing runs on bounded streams, and stream processing runs on unbounded streams. Internally, stream engines optimize performance on bounded streams through algorithms and data structures. On the other hand, they can handle processing events promptly for unbounded streams. APIs for bounded and unbounded streams might slightly differ. Luckily, engines are evolving to provide a singular interface for both bounded and unbounded stream processing.

Another approach for the combination of batch and stream processing comes from abstraction layers such as Apache Beam. As discussed in Appendix A, solutions like Apache Beam offers a unified programming model that can represent and transform data sets of any size regardless of bounded and unbounded streams from any data source. The programming model internally handles both unbounded and bounded data through different components. Moreover, such abstractions can run in multiple backends such as Flink, Spark, and Samza, making them easy to adapt or evolve.

11.2.3 Model Training

Machine‐learning algorithms have corresponding implementations on popular machine‐learning frameworks such as scikit‐learn. The main task is generally setting up good automation for the machine learning model training. With a good workflow setup, machine learning model training can run on the provided environment such as Kubernetes, Spark, and so forth. Once the model gets trained, the workflow might involve steps to push the model to production for consumption. The Big Data platform's main goal is to allow data scientists to train their models and easily productionalize the models.

11.2.4 A Holistic View

For large data systems and tools' solution space, there are a couple of verticals: storage, processing engines, exploration, pipelines, and supporting tools illustrated in Figure 11.1. Depending on the company budget and size, there should be a solution for each vertical to provide a reliable Big Data platform overall. Although there are many alternatives in each of the verticals, it is better to invest resources consistently. For example, if a company wants to use Spark as a processing engine, they may adjust the rest of the software stack to be compatible with it. However, it does not mean people in the organization should be limited to the tooling that is decided for the Big Data platform.

Schematic illustration of Big Data platform verticals.

Figure 11.1 Big Data platform verticals.

The storage layer consists of a foundation like HDFS or object store where data can be poured. The processing layer can be tools like Spark and Flink. The processing technologies can run on multiple resource managers such as Kubernetes or YARN. It depends on how the company invests in infrastructure. If the company already has experience in Kubernetes, running Spark on Kubernetes might be easier than building a Hadoop cluster. On exploration, many software choices can be leveraged, such as Presto. Data may be transferred to data warehouses or data marts for exploration and business intelligence. When it comes to pipeline tools, there are good choices like Airflow and Kubeflow. Various supporting tools can be used for security, access control, discovery, and other quality concerns.

While constructing solutions, the harmony of the tools should be observed. The solutions sometimes do not achieve the wanted results. Even then, good integrations with the existing tools and systems should be considered. For instance, most Apache projects integrate well with each other. When building a system or tool, existing projects should be incorporated. Integrations help solutions adopt faster, allows systems to generate value quicker, and expands the product horizon.

11.3 Challenges

Designing a Big Data platform is a long ride amidst many challenges. It entails an understanding of the overall usage, deployment, and maintenance of several Big Data systems. The more components are introduced, the more challenging it becomes. Each new system under a Big Data platform should seamlessly integrate with the rest. The Big Data platform has to evolve constantly. Adding new solutions and depreciating old ones must be done for a better platform experience. In this section, the challenges of designing Big Data platforms are discussed.

11.3.1 Growth

The growth is a nice challenge to have. It means the organization is doing well and the needs are multiplying. Potentially, there is much more budget to spend on the Big Data platform as well. The growth can happen in many dimensions. It is not just the raw data growth but also the number of data sources and the number of systems. Each new data source and Big Data system requires investment in human resources and infrastructure.

Data growth is a natural problem for all growing organizations. The data growth comes from additional services and new customers or users for the organization and requires thorough research on the data's patterns with seasonality effects. More data requires more nodes to multiple systems in the infrastructure. If the platform lives in a cloud environment, it is easier to adjust to spikes. It is harder to adjust immediately for an on‐premise solution, as the predictions for the growth should also include a buffer to meet sudden demands. Another matter to consider is the delivery time of new hardware from the day of order since experience tells that order completion can vary a lot.

As the organization grows, there can be more teams, more partners, more acquisitions. It is necessary to integrate with the new data sources to make the best decisions. The new data source integration might involve simple configuration or serious development time when integrations happen through API. Adding more integration points increases operational responsibility and coverage of a Big Data platform. The more resources shared, e.g. YARN, the more contention might happen between different sets of jobs.

Lastly, the organization might need to explore more systems and adopt more tools due to different needs across different organization pillar. The Big Data platform should support new technologies and adopt new solutions. Nevertheless, adopting a Big Data system or solution is not easy. It has to be configured, well‐understood, and operationalized. All of these take time and experience.

11.3.2 SLA

Big Data shapes decisions daily, hourly, or even on the scale of minutes. A delay in Big Data processing may cause losses in revenue and inefficiencies. Therefore, delivering data on time is a significant concern for Big Data platforms. Well‐timed delivery depends on many factors like downstream data, resource management, runbooks, and human errors. The Big Data platform should address many of these concerns in many directions to limit interruptions.

A large portion of interruptions come from downstream processing delays. Some things can be done to address downstream processing problems. The first step is to identify the owner. If the platform has a discovery tool, then it's relatively easy. The owner can be paged for the delivery problems. Next is to allow partial delivery when it is acceptable. Last is to hold a retrospective meeting for the delivery problem with the owning team.

If the platform depends on a shared execution engine among many teams, then a contention problem caused by one team can result in delays. Luckily, most of the resource managers have ways to separate workloads through queues and namespaces. The Big Data platform should allow easy configuration of workloads. Moreover, preemption can prevent resources from exceeding their limits and provide a fair share of resources depending on the attributed quota.

Setting up service level agreements (SLAs) is a good way to prevent further delays in the processing. Nevertheless, SLAs without actionable items would be rather useless and only wake someone up for no good. For each SLA defined, an associated runbook may be attached to the workflow or discovery tool. The platform should provide the necessary tooling for setting up runbook documentation.

Humans are error‐prone, and the platform can offer various solutions to decrease incidents. One of the key issues in data processing is to make changes and apply them to the data. The platform should enable people to make private runs to manipulate a piece of script or workflow easily and test out the results on sampled data. Another good tooling is to have anomaly detection such that the results can be validated. Anomaly detection can detect problems with various simple aggregations.

11.3.3 Versioning

Versioning data is not fun in any part of the software stack. It is not fun in the Big Data world either. Versioning data includes documenting metadata, data validation, schema evolution, and tracking changes on data. Big Data platform should offer solutions to different aspects of data versioning. Nevertheless, there is not a tool that encapsulates all these needs.

The documentation seems to be a burden, but it is an excellent way of written communication. A person looking at a data model may not know the conditions when the model was designed. A piece of information attached to the model can help maintainers or people from other departments to understand the contextual information. In theory, every model should be self‐documented. Data definition language (DDL) files for the model should be source‐controlled, and models should have descriptions and comments on attributes.

Moreover, additions such as alter commands should also have attribute comments. With this approach, the organization would not have any undocumented model. If the platform has a data discovery tool that can crawl models with respective documentation, people can quickly discover them on demand.

One of the cleanup tasks is dealing with corrupted data. Validating data can partially address corrupted data issues. If there are necessary validation steps in publishing or ingesting data, the cleanup step can be lighter. Validation steps can validate data against different validation rules. Type validation is the most obvious one where the validation step checks whether the input matches the type defined in the schema. The domain might require validation rules that involve multiple fields or custom validation. The critical part is the integration of validation to the rest of the infrastructure. It is hard to implement a central mechanism validating all models as they come from multiple sources. The better validation, the cleaner data the platform would receive.

The evolution of schemas is a technical and nontechnical effort. If the schema updates do not have breaking changes with the older version, updates can happen worry‐free. Nevertheless, breaking changes require deprecation and coordination steps. The deprecation steps involve making sure there are no downstream data that depend on the deprecated attribute and use alternative or new fields. The downstream teams should receive updates so that they can apply the necessary changes in time. The platform might offer a discovery tool to provide lineage over fields. Once downstream entities are identified, owners can get upcoming field updates.

Keeping track of changes that happened to data is another challenge. It is particularly important for machine learning tasks. Data scientists often need to version experiments to produce the same results given the same conditions. They want to keep track of data sets and the code that runs the experiments to build the model. Moreover, a data scientist might want to work on the same data for different purposes. Keeping a stable version of data becomes essential for collaboration.

11.3.4 Maintenance

System maintenance is a standard process for applying updates to existing software. System maintenance helps improving performance, addressing security problems, and solving bug fixes. In the Big Data realm, there are many tools and interconnected systems to maintain. Teams build additional tooling or systems for internal use cases. Keeping all these systems and tools in production and updating software is hard. While running an update, many things can go wrong as there is no system, regardless of how small it does not require maintenance and administration. Some of the maintenance tasks can be offloaded to cloud providers or Big Data system solutions. Maintenance has several categories, such as server maintenance, Big Data system maintenance, and tool maintenance.

Server maintenance includes hardware maintenance and OS maintenance. Hardware maintenance requires replacing hardware parts such as disks, CPUs, and cooling fans. The first task for hardware maintenance is the detection of hardware failures through alarms, typically, Nagios. The rest understand the nature of failure, remove the server out of the cluster, order parts if necessary or contact the manufacturer under warranty, and put the server back in the cluster. The second part is the OS maintenance that includes applying security patches, updating system software, and OS upgrades. Ideally, the server can be taken offline out of the server farm, and apply upgrades smoothly, and put it back with others. Nevertheless, the updates might not go as smoothly. Therefore, using updates in a test environment to servers would decrease the chance of having surprises in production.

Big Data system maintenance covers software updates and upgrades to Big Data systems. The most common version of applying updates and upgrades is through the rolling update mechanism. Operational teams generally implement playbooks that execute an update procedure. Teams can test playbooks with virtual environments and avoid potential issues in testing. There are also steps for client tools. With the upgrades on systems, client tools and user machines need updates too.

To ease the amount of time and resources spent in maintenance, cloud providers offer solutions that require much less maintenance through virtualization. Moreover, there is management software like Ambari or Cloudera Manager to help with maintenance, monitoring, and deployment. Some of these solutions also have enterprise plugins and addons to reduce maintenance further by additional functionality. Even with good solutions in the deployment and maintenance, some operations needs to be done, such as integrations and onboarding. The more systems the platform has, the more operational burden it would require.

11.3.5 Deprecation

Big Data is still an emerging technology. There are many tools and systems to replace existing solutions or solve new problems. Although new technologies are getting available, many organizations have outdated solutions. These solutions may still do the job, but they might not receive updates or support anymore. Before adopting new solutions, operational teams need to slowly deprecate existing systems and move off users or systems of the system that is on the deprecation path.

The deprecation process starts with analyzing the existing system and why it makes sense to deprecate in favor of another. The next step is to make a feature comparison between the old system and the new system. Often new systems do not have feature parity with the existing system. The third step is to find dependent clients and systems that use the existing system. With all necessary information compiled, the deprecation process can begin.

The deprecation process can start with notifying existing customers with a clear plan for the deprecation. Many customers might not be happy with the decision since they need to apply migration steps. Nevertheless, it is costly to operate two identical systems without much benefit in a Big Data platform. Maintaining the existing system, even if it is very stable, still requires some effort and resources. Once the operational team is sure of deprecation, they can execute the deprecation. Some people will be late for all announcements, and their system will break, but there is not much to be done. The best way to mitigate is to help them with migration.

11.3.6 Monitoring

Monitoring is a critical piece of Big Data platforms. Monitoring involves keeping track of physical resources, virtualized resources, cloud services, systems running on them, and data hosted on them. It is quite challenging to monitor every system under a Big Data platform and require engagement from the operational team. The more centralized the monitoring solution can be, the easier it gets to have a global understanding of systems and services. Monitoring needs some combination of metrics recording, visualization, alerting, and planning.

Recording metrics from multiple platforms into a central location is quite challenging. Operational teams can leverage tools like Prometheus to record metrics from various components. There are Prometheus exporters for numerous systems in the Big Data area. Cloud providers provide their metric recording solutions. There are also enterprise solutions that help to collect data. Recording metrics can commence using these solutions. One beneficial thing is heartbeat monitoring. It essentially gives simple information about whether a system is up/down. Other helpful metrics are overall latency and load on a system. With a combination of these, three critical problems can be detected easily. However, teams can leverage many other metrics to get a better understanding of the systems in the Big Data platform.

Once the metrics are available in a central repository, the teams can use metrics to set up dashboards, configure alerts, and plan various parts of the Big Data platform. Dashboards can give a quick overview of the systems and components in the Big Data platform. Alerting is necessary to monitor systems healthy and responsive. Lastly, metrics provide a glance at the direction of systems. With this information, management can estimate the cost and order hardware or plan expansion.

11.3.7 Trends

New technologies in the Big Data realm have many advantages since many new technologies are emerging that make the Big Data life cycle more comfortable. On the other hand, organizations cannot spend so much engineering resources on adopting new technologies. Organizations need to be cognizant of the latest technologies while delivering results using existing Big Data systems. Although new technologies are attractive at first, they often come with stability and community problems.

Most organizations do not need bleeding‐edge technology for their Big Data platform. Early adopters of such technologies have to deal with many edge cases until the technology gets stable. In many cases, the organization has to contribute back to the technology. While some organizations have a condition to contribute back, many organizations do not have it. Therefore, organizations should closely keep track of impactful changes proven to work for other organizations in most cases.

Moreover, replacing or upgrading existing technology inside the company is not free either. People within the organization got comfortable with one or more systems. Introducing new technology can cause friction if there are no obvious benefits. Even if there are apparent advantages, there are engineering challenges to integrating new technology into the existing Big Data platform. However, it does not mean the platform should not accept superior technologies. The Big Data platform should evolve with better systems and tooling when it is needed.

11.3.8 Security

Security has been discussed from different aspects in a previous chapter. Security for Big Data platforms is about tooling and managing access control on data and planning for breaches and leaks. Data leaks can damage an organization's image but also risk the users or customers of the organization. Credential leaks can allow access to bad actors to take down systems or steal sensitive data. Data breaches can result in an exposure of confidential details through a direct attack on the organization.

Security is not only a problem of the Big Data platform. It is a general problem for the organization. Nevertheless, while building systems and solutions for Big Data platforms, standard security practices should be applied. In the early days of Big Data technologies, security was not the first issue that people tackled. Nowadays, most open‐source Big Data systems come with security solutions such as vaults, credential APIs, and integrations to authentication/authorization parties. In designing a Big Data platform, security aspects should be considered.

11.3.9 Testing

Big Data processing involves multiple data sources, data coming in large formats, billions of rows, and the need to make critical decisions. With so many systems are blended, many steps can go wrong. Testing can partly help with error pruning. Several different kinds of testing strategies make systems and produced data consistent, reproducible, and reliable. In a Big Data platform, parts of the machinery can use unit testing, functional testing, integration testing, performance testing, system testing, A/B testing.

Unit testing is well‐established for many languages with good framework support. It is still hard to implement unit testing in the context of ETL jobs. The unit test is good for testing the behavior of one component. If the expectations can be set for a given data set, it is possible to unit test the pipeline's behavior. Hand provided data can be placed to the input sources, mocked ingestion part, and tested the expected rows in the outcome table. The execution engine needs to understand schema and table replacement to run unit tests. Ideally, test data should stay in the unit test schema and be ready for other test runs. If tests can be added to prechecks before accepting code, a continuous integration (CI) process with additional automation may be achieved.

The functional tests check that there are no errors in various phases of the big data life cycle. In the data acquisition phase, it validates the completeness of the data. And, various integrity checks can be done to make sure errors are within boundaries. Once the data gets validated, the next thing is to check the data processing and find the overall pipeline outcome. If data conforms to the expected standards that can be tested through anomaly detection, there is a level of confidence the data is reliable. The last part is a data presentation where a final validation check for the outcome table or dashboard is done.

Big Data systems integrate with many components. The components are constantly on the move. There is always a chance that integration with a component gets broken due to various reasons like credential rotation. Integration tests basically should check if the platform is still able to integrate two components. These tests can run on an hourly basis or a handful of times a day to get resolved before a massive backlog of jobs fails because of the integration problems.

Performance testing is important to determine if the system needs additional resources or to find bottlenecks in the platform. The bottlenecks can happen because of data ingestion, data processing, or data consumption out of the platform. Performance testing can happen between many components to get an overview of the platform and test systems individually. System performance testing can give an overall idea of how a given Big Data system performs under heavy load and summarize throughput, etc. End‐to‐end performance testing can reveal which systems get hosed under stress. Identifying the system or systems can unlock the opportunity to tune them.

A/B testing is the last method that is very useful to see the effectiveness of various machine‐learning models. It helps understand the effects of variables in a controlled environment and provides an opportunity to quantify errors correctly. It can help to identify the probability of making an error given the machine learning model. A/B testing requires post‐processing to determine bias and statistical analysis to determine the outcome of a test. Data science life cycle tools help with setting up A/B testing. Moreover, some solutions specifically help with the application of A/B testing.

There are many other methodologies and strategies to set up a Big Data platform from various aspects. An organization might not implement all testing strategies but should try to define what is essential and put resources to make sure the testing strategy gets implemented.

11.3.10 Organization

A major challenge in designing, building, and evolving Big Data platform is organization. If the organization does not have a data‐driven culture, it is hard to establish key habits and make room for developing a Big Data platform. Thus, it is critical to get an organization on board with Big Data. Organizations may not change easily. The value proposition of Big Data might not be obvious to everyone. Leadership has to establish the organization's position on Big Data. Even if the organization understands Big Data's value, it may lack decision‐making with Big Data, a common strategy, and a unified approach.

Embedding Big Data into daily decision‐making activities require changing of habits. Some organizations inherently are better at it because they have been data‐driven organizations from the establishment. In many cases, incorporating Big Data into daily routines might need executive sponsorship. Leadership can promote a data‐driven approach to key decisions. When people get positive results with data, an organization can celebrate achievements so that data becomes a norm for actions.

In many cases, organizations might want to use Big Data for many decisions. They can start building pipelines and make sense of it in some way. Different business units can make disparate attempts to get the best of the data. Without a common strategy, organizations will spend resources on similar tasks often. A couple of systems wired together in a different part of the organization might not bring a Big Data platform into life. Therefore, organizations need a common strategy and build key infrastructure for Big Data platforms.

A common strategy brings a unified approach and centralizes some of the infrastructure pieces into the platform. The infrastructure team can design, deploy, and maintain core pieces of the platform, while business units can leverage the platform and support the infrastructure team. A unified approach can help increasing efficiency across different business units. Different teams would encounter the same problems and can do knowledge transfer. With a unified approach, teams can better master the systems that the organization decides to maintain.

11.3.11 Talent

Big Data requires a diverse set of skillsets like communication skills, analytical skills, software engineering, operational experience, statistics, and business acumen. In addition to these skillsets, designing a Big Data platform requires understanding different Big Data systems, how they integrate, how they work together, their pain points, and vision for evolution. Although engineers can acquire some of these skills through self‐studies, they have to learn a big portion of it on the job. The experience is decisive since real‐life production experience allows dealing with edge cases, failures, and sometimes catastrophes. Finding talent who has some of these skills and retaining them is a big task.

A Big Data platform team needs a different set of skill sets. One person who has all the required skillsets is not a common scenario. It is rather important to find talent who complements each other. One team member can be good at security, and another at automation. Teams can design, deploy, and maintain Big Data systems with overall wisdom. Effectively, management should target skill sets that do not exist or are not sufficient to handle the team or teams' work. Management has often sacrificed one or more skillsets for talent. It is hard to find a talent who perfectly fits the environment. Hence, organizations should look for a good match with the learning path.

Retaining the existing talent has very complex dynamics and out of the scope of this book. Nevertheless, having an environment to learn, try, and fail helps engineers excel at what they do. Good engineers take pride in what they do, and giving them enough freedom with supporting activities is another way to help. Teams that work well are a huge asset. Spending time and budget on working teams is never a loss but a considerable gain in the long run.

11.3.12 Budget

Budget is a defining factor for many of the systems or talent for Big Data. Although it is nice to have the flexibility to use any technology or hire talent, organizations are restricted by budget in reality. The budget limits training for existing staff, recruitment, external consulting, and investment for additional services or resources. An organization has to keep ongoing Big Data systems as well as introducing new ones.

The challenge for a Big Data platform is to prioritize tasks. Every system or team might seem to need many new resources. However, many savings can happen under scrutiny. There are two major ways to cut down the cost: automation and optimization. Many procedures need to run for part of infrastructure maintenance, provisioning, and so forth. The more procedures a team can automate, the less work needs to be done. However, many systems can be tweaked to run with less memory or CPU with proper configuration. With a combination of the two, the running cost of a Big Data platform can be lowered.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.182.45