Home Page Icon
Home Page
Table of Contents for
The 3 V's
Close
The 3 V's
by Anuj Kumar
Architecting Data-Intensive Applications
Title Page
Copyright and Credits
Architecting Data-Intensive Applications
Packt Upsell
Why subscribe?
PacktPub.com
Contributors
About the author
About the reviewer
Packt is searching for authors like you
Preface
Who this book is for
What this book covers
To get the most out of this book
Get in touch
Reviews
Exploring the Data Ecosystem
What is a data ecosystem?
A complex set of interconnected data
Data environment
What constitutes a data ecosystem?
Data sharing
Traffic light protocol
Information exchange policy
Handling policy statements
Action policy statements
Sharing policy statements
Licensing policy statements
Metadata policy statements
The 3 V's
Volume
Variety
Velocity
Use cases
Use case 1 – Security
Use case 2 – Modem data collection
Summary
Defining a Reference Architecture for Data-Intensive Systems
What is a reference architecture?
Problem statement
Reference architecture for a data-intensive system
Component view
Data ingest
Data preparation
Data processing
Workflow management
Data access
Data insight
Data governance
Data pipeline
Oracle's information management conceptual reference architecture
Conceptual view
Oracle's information management reference architecture
Data process view
Reference architecture – business view
Real-life use case examples
Machine learning use case 
Data enrichment use case
Extract transform load use case
Desired properties of a data-intensive system
Defining architectural principles
Principle 1
Principle 2
Principle 3
Principle 4
Principle 5
Principle 6
Principle 7
Listing architectural assumptions
Architectural capabilities
UI capabilities
Content mashup
Multi-channel support
User workflow
AR/VR support
Service gateway/API gateway capabilities
Security
Traffic control
Mediation
Caching
Routing
Service orchestration
Business service capabilities
Microservices
Messaging
Distributed (batch/stream) processing
Data capabilities
Data partitioning
Data replication
Summary
Patterns of the Data Intensive Architecture
Application styles
API Platform
Message-oriented application style
Micro Services application styles
Communication styles
Combining different application styles
Architectural patterns
The retry pattern
The circuit breaker
Throttling
Bulk heads
Event-sourcing
Command and Query Responsibility Segregation
Summary
Discussing Data-Centric Architectures
Coordination service
Reliable messaging
Distributed processing
Distributed storage
Lambda architecture
Kappa architecture
A brief comparison of different leading No-Sql data stores
Summary
Understanding Data Collection and Normalization Requirements and Techniques
Data lineage
Apache Atlas
Apache Atlas high-level architecture
Apache Falcon
Data quality
Types of data sources
Data collection system requirements
Data collection system architecture principles
High-level component architecture
High-level architecture
Service gateway
Discovery server
Architecture technology mapping
An introduction to ETCD
Scheduler
Designing the Micro Service
Summary
Creating a Data Pipeline for Consistent Data Collection, Processing, and Dissemination
Query-Data pipelines
Event-Data Pipelines
Topology 1
Topology 2
Topology 3
Resilience
High-availability
Availability Chart
Clustering
Clustering and Network Partitions
Mirrored queues
Persistent Messages
Data Manipulation and Security
Use Case 1
Use Case 2
Exchanges
Guidelines on choosing the right Exchange Type
Headers versus Topic Exchanges
Routing
Header-Based Content Routing
Topic-Based Content Routing
Alternate Exchanges
Dead-Letter Exchanges
Summary
Building a Robust and Fault-Tolerant Data Collection System
Apache Flume
Flume event flow reliability
Flume multi-agent flow
Flow multiplexer
Apache Sqoop
ELK
Beats
Load-balancing
Logstash
Back pressure
High-availability
Centralized collection of distributed data
Apache Nifi
Summary
Challenges of Data Processing
Making sense of the data
What is data processing?
The 3 + 1 Vs and how they affect choice in data processing design
Cost associated with latency
Classic way of doing things
Sharing resources among processing applications
How to perform the processing
Where to perform the processing
Quality of data
Networks are everywhere
Effective consumption of the data
Summary
Let Us Process Data in Batches
What do we mean by batch processing
Lambda architecture and batch processing
Batch layer components and subcomponents
Read/extract component
Normalizer component
Validation component
Processing component
Writer/formatter component
Basic shell component
Scheduler/executor component
Processing strategy
Data partitioning
Range-based partitioning
Hash-based partitioning
Distributed processing
What are Hadoop and HDFS
NameNode
DataNode
MapReduce
Data pipeline
Luigi
Azkaban
Oozie
AirFlow
Summary
Handling Streams of Data
What is a streaming system?
Capabilities (and non-capabilities) of a streaming application
Lambda architecture's speed layer
Computing real time views
High-level reference architecture
Samza architecture
Architectural concepts
Event-streaming layer
Apache Kafka as an event bus
Message persistence
Persistent Queue Design
Message batch
Kafka and the sendfile operation
Compression
Kafka streams
Stream processing topology
Notion of time in stream processing
Samza's stream processing API
The scheduler/executor component of the streaming architecture
Processing concepts and tradeoffs
Processing guarantees
Micro-batch stream processing
Windowing
Types of windows
Summary
References
Let Us Store the Data
The data explosion problem
Relational Database Management Systems and Big data
Introducing Hadoop, the Big Elephant
Apache YARN
Hadoop Distributed Filesystem
HDFS architecture principles (and assumptions)
High-level architecture of HDFS
HDFS file formats
HBase
Understanding the basics of HBase
HBase data model
HBase architecture
Horizontal scaling with automatic sharding of HBase tables
HMaster, region assignment, and balancing
Components of Apache HBase architecture
Tips for improved performance from your HBase cluster
Graph stores
Background of the use case
Scenario
Solution discussion
Bank fraud data model (as can be designed in a property graph data store such as Neo4J)
Semantic graph
Linked data
Vocabularies
Semantic Query Language
Inference
Stardog
GraphQL queries
Gremlin
Virtual Graphs – a Unifying DAO
Structured data
CVS
BITES – Unstructured/Semistructured document store
Structured data extraction
Text extraction
Document queries
Highly-available clusters
Guarantees
Scaling up
Integration with SPARQL
Data Formats
Data integrity and validating constraints
Strict parsing of RDF
Integrity Constraint Validation
Monitoring and operation
Performance
Summary
Further reading
When Data Dissemination is as Important as Data Itself
Data dissemination
Communication protocol
Target audience
Use case
Response schema
Communication channel
Data dissemination architecture in a threat intel sharing system
Threat intel share – backend
RT query processor
View builder
Threat intel share – frontend
AWS Lambda
AWS API gateway
Cache population
Cache eviction
Discussing the non-functional aspects of the preceding architecture
Non-functional use cases for dissemination architecture
Elastic search and free text search queries
Summary
Other Books You May Enjoy
Leave a review - let other readers know what you think
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Prev
Previous Chapter
Metadata policy statements
Next
Next Chapter
Volume
The 3 V's
The 3 V's stand for:
Volume
Variety
Velocity
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset