What is big data architecture?

The sheer volume of collected data can cause problems. With the accumulation of more and more data, managing and moving the data along with its underlying big data infrastructure becomes increasingly difficult. The rise of cloud providers has facilitated the ability to move applications to the data. Multiple sources of data result in increased volumes, velocity, and variety. The following are some common computer-generated data sources:

  • Application server logs: Application logs and games
  • Clickstream logs: From website clicks and browsing
  • Sensor data: Weather, water, wind energy, and smart grids
  • Images and videos: Traffic and security cameras

Computer-generated data can vary from semi-structured logs to unstructured binaries. This data source can produce pattern-matching or correlations in data that generate recommendations for social networking and online gaming in particular. You can also use computer-generated data to track applications or service behavior such as blogs, reviews, emails, pictures, and brand perceptions.

Human-generated data includes email searches, natural language processing, sentiment analysis on products or companies, and product recommendations. Social graph analysis can produce product recommendations based on your circle of friends, jobs you may find interesting, or even reminders based on your circle of friend's birthdays, anniversaries, and so on.

In data architecture, the general flow of a significant data pipeline starts with data and ends with insight. How you get from start to finish depends on a host of factors. The following diagram illustrates a data workflow pipeline that needs to design for data architecture:

Big data pipeline for data architecture design

As shown in the preceding diagram, the standard workflow of the big data pipeline includes the following steps:

  1. Data is collected (ingested) by an appropriate tool.
  2. The data is stored in a persistent way.
  3. The data is processed or analyzed. The data processing/analysis solution takes the data from storage, performs operations, and then stores the processed data again.
  1. The data is then used by other processing/analysis tools or by the same tool again to get further answers from the data.
  2. To make answers useful to business users, they are visualized by using a business intelligence (BI) tool or fed into an ML algorithm to make future predictions.
  3. Once the appropriate answers have been presented to the user, this gives them insight into the data that they can then take and use to make further business decisions.

The tools you deploy in your pipeline determine your time-to-answer, which is the latency between the time your data was created and when you can get insight from it. The best way to architect data solutions while considering latency is to determine how to balance throughput with cost, because a higher performance and subsequently reduced latency usually results in a higher price.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.223.119.17