0%

Book Description

Companies working to become data driven often view data scientists as heroes, but that overlooks the vital role that data engineers play in the process. While data scientists focus on finding new insights from datasets, data engineers deal with preparation—obtaining, cleaning, and creating enhanced versions of the data an organization needs. In this report, Andy Oram examines how the role of data engineer has quickly evolved.

DBAs, software engineers, developers, and students will explore the responsibilities of modern data engineers and the skills and tools necessary to do the job. You’ll learn how to deal with software engineering concepts such as rapid and continuous development, automation and orchestration, modularity, and traceability. Decision makers considering a move to the cloud will also benefit from the in-depth discussion this report provides.

This report covers:

  • Major tasks of data engineers today
  • The different levels of structure in data and ways to maximize its value
  • Capabilities of third-party cloud options
  • Tools for ingestion, transfer, and enrichment
  • Using containers and VMs to run the tools
  • Software engineering development
  • Automation and orchestration of data engineering

Table of Contents

  1. The Evolving Role of the Data Engineer
    1. Data Engineering Today
      1. Evaluation Process
      2. Business Intelligence (BI) and Serving the Analysts
      3. Example of Data Exploration
      4. Limitations on Data Use
      5. Data Is Different Today
    2. Structuring Data
      1. Fields, Columns, and Schemas
      2. Example: Duplication and Normalization
      3. Structured Storage Formats
      4. Data Warehouses and Data Lakes
      5. Database Options
      6. Access to Data Stores: SQL and APIs
      7. Cloud Storage
      8. Object and Tiered Storage
      9. Partitioning
    3. Choosing the Right Data Processing Engine
      1. Data Ingestion and Transfer: Message Brokers
      2. Streaming Data Processing
      3. Example Workflow for Streaming Tools
    4. Development Best Practices
      1. Common Development Tools
      2. Metrics and Evaluation
    5. Orchestration
      1. Resource Management
      2. Scheduling
      3. Fault Tolerance and Checkpoints
    6. Conclusion
  2. Appendix. Best Practices for Managing Resources
    1. Containers and Virtual Machines
    2. Operating System Support
3.16.147.124