Spring Cloud Function plays a critical role in the hybrid cloud or on-premises/private cloud space. Building events/data pipelines that span across the local datacenter and the cloud increases complexity due to the firewall boundaries. Data pipelines play an important role when you want to acquire, move, or transform data as it comes (streaming) or when it’s offline (batch).
So, what are event data pipelines?
4.1 Data Event Pipelines
All data regarding the incident is collated, processed, analyzed, and sent to various target systems, both internal and external to OnStar. This is an example of event data pipeline.
A data pipeline is a set of processes that takes raw data from disparate sources, and then filters, transforms, and moves the data to target stores, where it can be analyzed and presented in a visually compelling way.
The data can also be unstructured or structured. Unstructured data can be from a multitude of resources, including Internet-based platforms such as Twitter, to an automotive device emanating data.
Structured data usually originates from within the company’s applications. This data needs to be combined at a landing zone of object stores. These object stores can be on-premises or in the cloud. Once the data is ingested into the object stores, it is then retrieved and transformed, usually through an ETL process in other datastores such as data lakes or data warehouses. This data is then further analyzed using BI tools such as Power BI and Tableau or further processed with the AI/ML processes.
This whole process—from data ingestion to consumption—is called the data pipeline.
Let’s dive a bit deeper into the various sub-processes outlined in the data pipeline process.
4.1.1 Acquire Data
Acquiring data is the first step in the data pipeline process. Here, business owners and data architects decide on what data to use to fulfil the requirements for a specific use case.
For example, in the case of a collision event detection for OnStar, the data needs to be acquired from sensors. This sensor data then needs to be combined with data from internal and external partners, like finance, towing partners, rental partners, and so on.
Data Identification
Data | Type of Data | Data Source |
---|---|---|
Sensor | Unstructured (JSON, CSV) | Sensors |
Rental info | Relational (RDBMS) | External rental agency |
User info | Relational (RDBMS) | OnStar registration |
Vehicle info | Relational (RDBMS) | Vehicle data systems |
Towing info | Relational (RDBMS) | External towing company |
4.1.2 Store/Ingest Data
The data from various sources is acquired and stored as raw data into the store of choice. The method of ingestion can be stream-based or batch-based.
For example, in the case of OnStar, sensor data is streamed at regular intervals or is event-based. Other data, such as rental info, can either be batch based or on-demand query driven. The raw datastore can be an S3 object store hosted in the cloud or on-premises.
4.1.3 Transform Data
The data that is stored raw in the object store is then processed. The process may include converting or transforming unstructured data into structured data and storing it in an RDBMS database.
For example, in OnStar, partner data and internal data will be combined and transformed into a common data model. The sensor data will also be transformed into an RDBMS format.
4.1.4 Load Data
Once the data is transformed, it is then loaded into a single or multiple databases with a common schema. This schema is based on a predefined data model specific to the use case. The target datastore can be a data lake or another RDBMS store. Here again, it depends on the type of analysis that needs to be done.
For example, if this an OLTP type of analysis, the data needs to be processed and sent to requesting systems quickly. This would require an RDBMS store. Data that needs to be available for reporting and research can be stored in a data lake.
4.1.5 Analyze Data
During this sub-process, the data that is stored in a data lake or RDBMs will be analyzed using tools such as Tableau, Power BI, or a dedicated web page for reporting.
In the case of OnStar, data that is stored in the data lake will be analyzed using Tableau or Power BI, while the data that needs immediate attention will be analyzed by a custom dashboard or reporting interface on the web.
Spring Cloud Function plays an integral role in the whole process, especially when combined with tools such as Spring Cloud Data Flow, AWS Glue, Azure Data Factory, Google’s data flow, and so on. You will dive deep into these tools in this chapter.
4.2 Spring Cloud Function and Spring Cloud Data Flow and Spring Cloud Streams
Spring Cloud Data Flow (SCDF) is a Spring.io-based product that supports the creation and deployment of a data pipeline. It supports batch and event-driven data, which makes it versatile. SCDF pipelines can be built programmatically or wired up through a GUI. It is heavily based on the Spring Framework and Java, which makes it very popular among Spring developers.
As you can see from the dashboard, you can build stream-based or batch-based (task) data pipelines and manage these through a single dashboard.
SCDF, unlike other the data pipeline tools available in the cloud, can be deployed in a Kubernetes, Docker, or Cloud Foundry environment, making it a portable tool for data pipeline development and deployment.
4.2.1 Spring Cloud Function and SCDF
Spring Cloud Function and SCDF are perfectly matched, as they are built out of the same framework, Spring. You can deploy Spring Cloud Function as a source, processor, sink, or as a trigger for the pipeline. Since the data pipelines are usually invoked sporadically for processing data, you can optimize utilization of resources and costs with a Spring Cloud Function.
Let’s look at a sample implementation of Spring Cloud Function with SCDF.
In this example, you will build a simple data pipeline using RabbitMQ as a source, do a simple transformation, and store the messages in a log. You will publish sample vehicle information into a RabbitMQ topic called VehicleInfo and do a simple transformation, then store it in a log file.
RabbitMQ ➤ Transform ➤ Log
SCDF deployed on Kubernetes or locally in Docker
Kubernetes or Docker
A RabbitMQ cluster/instance
A queue to publish messages
Code from GitHub at https://github.com/banup-kubeforce/SCDF-Rabbit-Function.git
Additional prerequisites for each environment can be found at https://dataflow.spring.io/docs/installation
Local machine: Docker Compose is an easy way to get an instance of Spring Cloud Data Flow up and running. The details of the installation are available at https://dataflow.spring.io/docs/installation/local/docker/
Kubernetes: Instructions for installing this on Kubernetes are available athttps://dataflow.spring.io/docs/installation/kubernetes/.
You can now access SCDF using the external IP. For example, http://20.241.228.184:8080/dashboard
The next step is to add applications. Spring provides a standard set of templates that you can use to build your pipeline.
Step 2: Add Applications to your SCDF Instance
Figure 4-8 shows you the options to add applications that are custom built through the Registering One or More Applications option. You import the application coordinates from an HTTP URI or use a properties file. There is also an option to import some prebuilt starters from Spring. Furthermore, you can choose starters that are Maven- or Docker-based or RabbitMQ- or Kafka-based. RabbitMQ and Kafka are used as backbone messaging systems for internal SCDF components and not for external use. When deploying to a Kubernetes cluster, you have to choose a Docker-based starter. When deploying locally in a Docker environment, you can choose between Maven and Docker-based starters.
These prebuilt templates come in three categories—source, processor, and sink. They allow you to wire up a data pipeline without the need for coding. If you want a custom component, you can follow the examples in https://dataflow.spring.io/docs/stream-developer-guides/.
The next step is to create a stream using the starter templates you loaded.
Step 3: Create a Stream
- 1.
Pick the source.
- 2.
Pick a processor.
- 3.
Pick a sink.
- 4.
Configure RabbitMQ.
- 5.
Configure the transform.
- 6.
Configure the sink.
- 7.
Wire up the data pipeline.
- 8.
Deploy the stream
Step 4: Create a function to publish data to RabbitMQ
Here, you create a Spring Cloud Function to publish the data to start the data-pipeline process.
SenderConfig
QueueSender
SenderFunction
A POM with the necessary dependencies for RabbitMQ and Spring Cloud Functions, as shown in Listing 4-1.
pom.xml with RabbitMQ Dependencies
application.properties
1. Create the SenderConfig Component
application.properties
2. Create the QueueSender Component
QueueSender.java
3. Wrap the Sender in Spring Cloud Function framework
SenderFunction.java
Step 5: Test the function using Postman
Use the GET function on Postman and provide the URL to the senderFunction.
You should get the result shown in this image.
Check the RabbitMQ queue for any messages.
Now you have a function that can post messages to a RabbitMQ topic. The SCDF data pipeline will be listening to the queue and will start processing.
You have seen how to create a data pipeline in SCDF that monitors a topic in RabbitMQ. You created a Spring Cloud Function that posts messages into the RabbitMQ topic. Spring Cloud Function can also be deployed as a source in SCDF; more information on how to develop code for SCDF is available on the SCDF site.
4.3 Spring Cloud Function and AWS Glue
AWS Glue works very similarly to Spring Cloud Data Flow in that you can wire up a data pipeline that has a source, processor, and sink. More information can be found at https://us-east-2.console.aws.amazon.com/gluestudio/home?region=us-east-2#/.
Spring Cloud Function can participate in the data pipeline process as a trigger, or simply by integrating with one of the components in the data pipeline.
For example, if you have AWS Kinesis as a source and you need to get data from a vehicle, you can have Spring Cloud Function stream the data that it gets into AWS Kinesis.
In the example in this section, you will be publishing data into AWS Kinesis and then kick off an AWS Glue job manually.
The flow will be:
Spring Cloud Function ➤ Kinesis ➤ AWS Glue Job ➤ S3
Subscription to AWS, AWS Glue job, Kinesis, and S3
AWS Glue job with Kinesis as the source and S3 as the target
Code from GitHub at https://github.com/banup-kubeforce/Kinesis_trigger.git
It is assumed that you have some knowledge of AWS Glue, as we do not delve into the details of this product. The focus is on creating the Spring Cloud Function.
4.3.1 Step 1: Set Up Kinesis
You can get to Kinesis at https://us-east-1.console.aws.amazon.com/kinesis/home?region=us-east-1#/home.
You can now connect and publish to the stream using your Spring Cloud Function.
4.3.2 Step 2: Set Up AWS Glue
You can access the AWS Glue Studio at https://us-east-1.console.aws.amazon.com/gluestudio/home?region=us-east-1#/.
Once you have the subscription, you can begin creating a glue job.
Create a job called vehicledatapipeline. Use Amazon Kinesis as the source and Amazon S3 as the target. Ensure that you set the proper configurations for each of these components.
Now you have a job in AWS Glue that you can trigger manually or via a function.
4.3.3 Step 3: Create a Function to Load Data into Kinesis
application.properties
pom.xml dependecies
TrackDetail.java
ProducerService.java
ProducerServiceImpl.java
ProducerFunction.java
8: Test with Postman. Run a POST-based test against ProducerFunction to publish data into Kinesis.
You will get a message that the data is saved.
10: Run Glue manually. From the Glue Studio, start the process by clicking Run, as shown in Figure 4-24.
In this section, you learned how to create a Spring Cloud Function that can post data into AWS Kinesis that is part of the data pipeline. You learned that you can publish data into Kinesis and trigger the AWS Glue pipeline manually, but I also encourage you to explore other ways you can implement Spring Cloud Function for AWS Glue, such as creating and deploying triggers. More information on how to create AWS Glue triggers in Spring is available at https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/examples-glue.html.
4.4 Spring Cloud Function and Google Cloud Dataflow
Google Cloud Dataflow is very similar to Spring Cloud Data Flow, in that it allows you to wire up a data pipeline with a source, processor, and sink. The Dataflow product is easier to develop with. You can read about Dataflow and its capabilities at https://cloud.google.com/dataflow.
For the example in this section, you will create a dataflow that includes cloud pub/sub:
Spring Cloud Function ➤Dataflow {Cloud Pub/Sub ➤ Cloud Storage}
Subscription to Google Data Flow
A cloud pub/sub instance
A cloud storage bucket
Code from GitHub at https://github.com/banup-kubeforce/GooglePubSub
Step 1: Create and configure a cloud pub/sub instance.
Before coming to this step, ensure that you are subscribed to cloud pub/sub. Also ensure that you have proper subscriptions to the APIs.
Now you are ready to build the Dataflow data pipeline.
Step 3: Create a data pipeline. Navigate to the Dataflow dashboard and create a pipeline. I created a pipeline using a prebuilt template.
1: Pick the template
2: Set the parameters for pub/sub and cloud storage.
3: Verify the creation of the data pipeline.
Step 4: Create the Spring Cloud Function.
MessageEntity class to formulate the message
TrackDetail class as the entity class
PubSubPublisher class that subscribes to the topic and publishes data
ProducerFunction class to implement the Spring Cloud Function
Maven dependencies
Maven Dependencies
application.properties
1: Create the MessageEntity class.
MessageEntity.java
2: Create the TrackDetail class.
TrackDetail.java
3: Create the PubSubPublisher.
PubSubPublisher.java
4: Create the Spring Cloud Function.
ProducerFunction.java
5: Run the application and test if a message is published to Cloud Pub/Sub.
6: Verify is the message has been loaded into Cloud Storage.
In this section, you learned how to use Spring Cloud Function to trigger a Google Cloud Dataflow-based data pipeline.
4.5 Summary
This chapter explained how to create dataflow and data pipelines, whether on-premises using SCDF or in the cloud. For the cloud, you can use SCDF or cloud-native tools.
Spring Cloud Function is versatile and can be used in the context of data pipelines as a trigger or as part of the flow.
With AWS Glue and Google Data Flow, you saw that you can use Spring Cloud Function as a trigger for the flows. This requires some additional coding by adding some relevant libraries and invoking the flow.
Upcoming chapters discuss other use cases of Spring Cloud Function.