Home Page Icon
Home Page
Table of Contents for
Cover
Close
Cover
by Jon King, Holden Ackerman
Operationalizing the Data Lake
Acknowledgments
Foreword
Introduction
Overview: Big Data’s Big Journey to the Cloud
My Journey to a Data Lake
A Quick History Lesson on Big Data
The Second Phase of Big Data Development
Weather Update: Clouds Ahead
Bringing Big Data and Cloud Together
Commercial Cloud Distributions: The Formative Years
Big Data and AI Move Decisively to the Cloud, but Operationalizing Initiatives Lag
We Believe in the Cloud for Big Data and AI
1. The Data Lake: A Central Repository
What Is a Data Lake?
Data Lakes and the Five Vs of Big Data
Data Lake Consumers and Operators
Operators
Consumers (Both Internal and External)
Challenges in Operationalizing Data Lakes
2. The Importance of Building a Self-Service Culture
The End Goal: Becoming a Data-Driven Organization
Foster a Culture of Data-Driven Decision Making
Build an Organizational Structure That Supports a Self-Service Culture
Putting a Self-Service Technological Infrastructure in Place
Challenges of Building a Self-Service Infrastructure
Lack of Specialized Expertise
Disparity and Distribution of Data
Organizational Resistance
Reluctance to Commit to Open Source
3. Getting Started Building Your Data Lake
The Benefits of Moving a Data Lake to the Cloud
Key Benefit: The Ability to Separate Compute and Storage
When Moving from an Enterprise Data Warehouse to a Data Lake
Cloud Data Warehouse
Distributed SQL
How Companies Adopt Data Lakes: The Maturity Model
Stage 1: Aspiration—Thinking About Moving Away from the Data Warehouse
Stage 2: Experimentation—Moving from a Data Warehouse to a Data Lake
Stage 3: Expansion—Moving the Data Lake to the Cloud
Stage 4: Inversion
Stage 5: Nirvana
4. Setting the Foundation for Your Data Lake
Setting Up the Storage for the Data Lake
Immutable Raw Storage Bucket
Optimized Storage Bucket
Scratch Database
The Sources of Data
Getting Data into the Data Lake
Automating Metadata Capture
Data Types
Structured Data
Semi-Structured Data
Unstructured Data
Storage Management in the Cloud
Data Governance
5. Governing Your Data Lake
Data Governance
Privacy and Security in the Cloud
Security Governance
Financial Governance
A Deeper Dive into Why the Cloud Makes Solid Financial Sense
How to Mitigate Cloud Costs: Autoscaling
Spot Instances
Measuring Financial Impact
Qubole’s Approach to Autoscaling
6. Tools for Making the Data Lake Platform
The Six-Step Model for Operationalizing a Cloud-Native Data Lake
Step 1: Ingest Data
Step 2: Store, Monitor, and Manage Your Data
Step 3: Prepare and Train Data
The Importance of Data Confidence
Tools for Data Preparation
Step 4: Model and Serve Data
Tools for Deploying Machine Learning in the Cloud
Open Source Machine Learning Tools
Managed Machine Learning Services
Cloud Machine Learning Services
Step 5: Extract Intelligence
Tools for Extracting Intelligence
Getting Data Out of Your Data Lake
Presto for Ad Hoc Analytics
Step 6: Productionize and Automate
Tools for Moving to Production and Automating
Open Source Workflow Schedulers
ETL Managed Services
7. Securing Your Data Lake
Consideration 1: Understand the Three “Distinct Parties” Involved in Cloud Security
Consideration 2: Expect a Lot of Noise from Your Security Tools
Consideration 3: Protect Critical Data
Consideration 4: Use Big Data to Enhance Security
8. Considerations for the Data Engineer
Top Considerations for Data Engineers Using a Data Lake in the Cloud
Protect Your Users
Ensure That Data Governance Is in Place
Designate Areas for Raw and Optimal Data Storage
Considerations for Data Engineers in the Cloud
Summary
9. Considerations for the Data Scientist
Data Scientists Versus Machine Learning Engineers: What’s the Difference?
Data Scientist Use Cases
How a Data Scientist Begins a Project
Top Considerations for Data Scientists Using a Data Lake in the Cloud
10. Considerations for the Data Analyst
A Typical Experience for a Data Analyst
Top Considerations for Data Analysts Using a Data Lake in the Cloud
11. Case Study: Ibotta Builds a Cost-Efficient, Self-Service Data Lake
12. Conclusion
Best Practices for Operationalizing the Data Lake
General Best Practices
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Next
Next Chapter
Operationalizing the Data Lake
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset