0%

Book Description

The world’s most valuable resource is data. Companies across all industry verticals are using data-driven insights as a key competitive advantage. But the time required for transforming raw data to insights can take days or weeks when you want it in minutes or hours. Data scientists spend nearly 80% of their time in data engineering, rather than developing insights. And most organizations can't scale their data science teams fast enough to keep up with growing business needs for better, faster insights.

This book will help data engineers, data scientists, and data team managers address these issues by building a self-service data science platform that democratizes the ability to extract insights from the data to everyone in the organization. Data scientists, software engineers, product managers, and marketers can use it to discover, transform, and analyze data and publish automated insights in production.

This book is not:

  • A deep dive into the “shiny new” technologies, or any one specific technology
  • A silver bullet technology for building a self-service portal. Organizations differ in their maturity, people, process, and technology and require tailored solutions

This book is:

  • A collection of must-have operational capabilities for building a self-service data portal
  • A blueprint for achieving better and faster insights
  • A process for democratizing data engineering expertise across an organization
  • A practical and indispensable guide for any decision-maker, implementer, or strategist working with an organization’s data science platform

Table of Contents

  1. 1. Metadata Catalog Service
    1. Journey Map Context
      1. Understanding Datasets
      2. Analyzing Datasets
      3. Knowledge Scaling
    2. Minimizing Time to Interpret
      1. Extracting Technical Metadata
      2. Extracting Operational Metadata
      3. Gathering Tribal Knowledge
    3. Defining Requirements
      1. Technical Metadata Extractor Requirements
      2. Operational Metadata Requirement
      3. Tribal Knowledge Aggregator
    4. Implementation Service
      1. Source-specific Connectors pattern
      2. Lineage Correlation pattern
      3. Tribal Knowledge Pattern
    5. Summary
  2. 2. Search Service
    1. Journey Map Context
      1. Feasibility Analysis of Business Problem
      2. Selecting relevant datasets for data prep
      3. Re-using existing artifacts for prototyping
    2. Minimizing Time to Find
      1. Indexing of datasets and artifacts
      2. Ranking results
      3. Access control
    3. Defining Requirements
      1. Indexer Requirements
      2. Ranking Requirements
      3. Access Control Requirements
      4. Non-functional Requirements
    4. Implementation Patterns
      1. Push-Pull Indexer pattern
      2. Hybrid Search Ranking Pattern
      3. Catalog Access Control Pattern
    5. Summary
  3. 3. Feature Store Service
    1. Journey Map Context
      1. Finding Available Features
      2. Training Set Generation
      3. Feature Pipeline for Online Inference
    2. Minimize Time to Featurize
      1. Feature Computation
      2. Feature Serving
    3. Defining Requirements
      1. Feature Computation
      2. Serving
      3. Non-functional Requirements
    4. Implementation
      1. Hybrid Feature Computation pattern
      2. Feature Registry Pattern
    5. Summary
  4. 4. Data Movement Service
    1. Journey Map Context
      1. Aggregating Data Across Sources
      2. Moving Raw Data to Specialized Query Engines
      3. Moving Processed Data to Serving Stores
      4. Exploratory Analysis Across Sources
    2. Minimizing Time to Data Availability
      1. Data Ingestion Configuration and Change Management
      2. Compliance
      3. Data Quality Verification
    3. Defining Requirements
      1. Ingestion Requirements
      2. Transformation Requirements
      3. Compliance Requirements
      4. Verification Requirements
      5. Non-functional Requirements
    4. Implementation Patterns
      1. Batch Ingestion Pattern
      2. Database Change Data Capture Ingestion Pattern
      3. Event Aggregation Pattern
    5. Summary
44.200.40.97