Index
A
- A/A testing, Sonia Mehta
- A/B testing, Sonia Mehta-Sonia Mehta
- abstractions, Julien Le Dem
- access, monitoring/logging/testing, Monitor, Log, and Test Access
- ACID (atomicity, consistency, isolation, and durability), Einat Orr-Einat Orr
- actors model, Message Passing/Actors Model
- Agile, Bob Haffner
- alerting, Logging and Alerting
- algorithmic logic, Jeff Magnusson
- algorithmic randomness, eliminating, Dr. Tianhui Michael Li
- Amazon Kinesis outage (2020), Matthew Housley, PhD-Further Reading
- Amazon Redshift, James Densmore
- Amazon S3, S3 Compatibility
- Amazon Web Services (AWS), Christiano Anderson
- analytics, microservice architecture and, Elias Nema-Elias Nema
- Apache Avro, From Generic Data Lake to Data Structure Store
- Apache Cassandra, Paul Brebner
- Apache Hive, Jonathan Seidman
- Apache Kafka
- Apache Spark
- application interface, Application Interface and Service Guarantee
- arrival time window (ATW) batching model, Arrival Time Window Batching Model
- arrival_timestamp, Raghotham Murthy
- assets, total opportunity cost of ownership, Joe Reis-Joe Reis
- automation
- autonomy, data engineering for, Jeff Magnusson-Bake Data-Flow Logic into Tooling and Infrastructure
- Avro, From Generic Data Lake to Data Structure Store
- AWS (Amazon Web Services), Christiano Anderson
B
- batch data pipeline, core problems in, Batch and Real-Time Systems
- batch processes, streaming versus, Dean Wampler, PhD-Dean Wampler, PhD, Marta Paes Moreira and Fabian Hueske
- batching model, for data pipelines, Raghotham Murthy-ATW and DTW Batching in the Same Pipeline
- behavioral patterns, Mukul Sood
- Bezos, Jeff, Joe Reis
- bias, systematic data issues and, Dr. Sandeep Uttamchandani
- big data, Ami Levin-Ami Levin
- big-data-focused engineering/engineers, Types of Data Engineering
- Bonewald, Silona, Emily Riederer
- business (algorithmic) logic, Jeff Magnusson
- business capabilities, structuring data science team around, Eric Colson
- business context, for data projects, Andrew Stevenson-Andrew Stevenson
- business data, on dashboards, Valliappa (Lak) Lakshmanan-Valliappa (Lak) Lakshmanan
- business goals, data platform production and, Align Your Product’s Goals with the Goals of the Business
- business metrics, ML project failures and, Dr. Sandeep Uttamchandani
- business rules, as checks on data quality, Emily Riederer
- buy-in, from stakeholders, Gain Feedback and Buy-in from the Right Stakeholders
C
- C10K problem, Solving the C10K Problem
- canonical data model, Boris Lublinsky
- CAP (consistency, availability, partition tolerance) theorem, Paul Doran-Paul Doran
- career building, Vijay Kiran-Vijay Kiran
- Cassandra, Paul Brebner
- caution, not relying on, Don’t Rely on Caution
- change data capture (CDC), Raghotham Murthy-Raghotham Murthy, Gunnar Morling
- change log, Raghotham Murthy
- checkpointing, Scott Haines
- close of books, ATW and DTW Batching in the Same Pipeline-ATW and DTW Batching in the Same Pipeline
- cloud technologies
- column names, as contracts, Emily Riederer-Emily Riederer
- commit log, distributed, Gunnar Morling
- compiled libraries, From Generic Data Lake to Data Structure Store
- compression (see data compression)
- computation component of data pipeline, Computation Component
- concurrency, Amazon Kinesis outage and, Matthew Housley, PhD-Further Reading
- consent metadata, Attach Consent Metadata
- constraints, haiku/software parallels, Understand the Constraints Up Front
- containers as a service, Rustem Feyzkhanov
- context (see data context)
- contracts, column names as, Emily Riederer-Emily Riederer
- controlled vocabulary, for naming fields, Emily Riederer
- coordination costs, Eric Colson
- copying data, Copying Data Creates Problems
- cost-per-byte, value-per-byte versus, Dhruba Borthakur
- costs
- creational patterns, Mukul Sood
- creativity, haiku/software parallels, Engage the Creative Side of Your Brain
- credit (recognition), for data engineering work, Jesse Anderson-Jesse Anderson
D
- DAG (directed acyclic graph), Mukul Sood-Mukul Sood, Emily Riederer
- dashboards, Valliappa (Lak) Lakshmanan-Valliappa (Lak) Lakshmanan
- data
- data analysis, change data capture and, Raghotham Murthy-Raghotham Murthy
- data collection, consensual/privacy-aware, Katharine Jarmul-Drop or Encrypt Sensitive Fields
- data communities, Emily Riederer-Emily Riederer, Gain Feedback and Buy-in from the Right Stakeholders, Tom Baeyens
- data compression, hidden costs of, Data Compression
- data consumers, cultivating good working relationships with, Ido Shlomo-Understand Consumers’ Jobs
- data context, data validation and, Emily Riederer-Emily Riederer
- data contracts, Establishing Data Contracts
- data culture, Emily Riederer-Emily Riederer
- data dictionary, as latent documentation, Emily Riederer
- data domains, problem solving and, Matthew Seal-Matthew Seal
- data engineering (generally)
- data engineers (generally)
- career-building, Vijay Kiran-Vijay Kiran
- data quality for, Katharine Jarmul-Katharine Jarmul
- as data science enablers, Lewis Gavin-So What Am I Getting At?
- data security for, Katharine Jarmul-Ask for Help
- getting credit in organization, Jesse Anderson-Jesse Anderson
- maintaining your mechanical sympathy, Tobias Macey
- and mistakes, Bartosz Mikulski-Bartosz Mikulski
- need for, Why the Need for Data Engineers?
- privacy responsibilities, Stephen Bailey, PhD
- professional responsibility towards user information, Lohit VijayaRenu-Watch Your Data Footprint
- short-term goals versus longer-term considerations, Matthew Seal
- skills needed for, Vijay Kiran
- traditional data professionals versus, Why the Need for Data Engineers?
- two types of, Types of Data Engineers
- data format, hidden costs of, Data Format
- data gravity, Tobias Macey
- data infrastructure, pitfalls of complexity, Matthew Seal
- data input/output (I/O), hidden costs of, Lohit VijayaRenu-Data Serialization
- data lake, Vinoth Chandar-Implementation
- data latency, Dhruba Borthakur
- data lineage, Julien Le Dem-Julien Le Dem
- data mesh, Barr Moses and Lior Gavish-The Final Link: Observability
- data migration, Nimrod Parasol
- data model contract, Agreeing on a Data Model Contract
- data mutiny, preventing, Sean Knapp-Sean Knapp
- data observability, Barr Moses-Introducing Data Observability, The Final Link: Observability
- data orchestration system, Embracing Data Silos
- data pipelines
- automation of tests, Tom White-Make It Easy to Add More Tests
- batching model for, Raghotham Murthy-ATW and DTW Batching in the Same Pipeline
- common pitfalls of single-stage approach, Common Pitfalls
- data latency and, Dhruba Borthakur
- data quality testing, Katharine Jarmul-Katharine Jarmul
- design patterns for reusability/modularity/extensibility, Mukul Sood-Mukul Sood
- evolution with business growth, Chris Heinzmann-Chris Heinzmann
- execution time issues, Rustem Feyzkhanov-Rustem Feyzkhanov
- message queues, Scott Haines
- ML project failures and, Dr. Sandeep Uttamchandani
- need for business data on dashboards, Valliappa (Lak) Lakshmanan-Valliappa (Lak) Lakshmanan
- setting foundations before writing code for, Meghan Kwartler-Meghan Kwartler
- visualization as part of frontend, Emily Riederer
- data platforms
- data preparation, Clean Data == Better Model
- data privacy, Stephen Bailey, PhD-Stephen Bailey, PhD
- data processing, best practices for, Christian Lauer-Conclusion
- data products
- data provenance, Track Data Provenance
- data quality, Katharine Jarmul-Katharine Jarmul
- data science projects
- data science teams
- data scientists
- data security, Katharine Jarmul-Ask for Help
- data serialization, hidden costs of, Data Serialization
- data silos
- data stack
- data stores
- data structure frameworks, From Generic Data Lake to Data Structure Store
- data systems, implications of CAP theorem for, Paul Doran-Paul Doran
- data testing
- data time window (DTW) batching model, Data Time Window Batching Model
- data validation, Emily Riederer-Emily Riederer, Anthony Burdi-Anthony Burdi
- data value chain, Jesse Anderson
- data warehouses (DWHs), James Densmore-James Densmore, Gleb Mezhanskiy-Zero Maintenance
- data-engineering projects, ten must-ask questions for, Haidar Hadi-Question 10: Who Will Be Taking Over This Project?
- data-flow logic, Jeff Magnusson, Move the Logic to the Edges of the Pipelines
- database administration, Database Administration, ETL, and Such
- database-replication software, Ensure Transaction Security
- DataOps
- datasets
- data_timestamp, Raghotham Murthy
- Davenport, Thomas, Ami Levin
- decorator patterns, Mukul Sood
- Dehghani, Zhamak, Barr Moses and Lior Gavish
- design patterns, Mukul Sood
- DevOps, data observability and, Introducing Data Observability
- directed acyclic graph (DAG), Mukul Sood-Mukul Sood, Emily Riederer
- disaggregation, Disaggregated Data Stack
- discoverability, metadata services and, Discoverability
- distributed commit log, Gunnar Morling
- distributed computing, Adi Polak-Conclusions
- distributed data systems, Paul Doran-Paul Doran
- distributed shared memory models, Distributed Shared Memory Model
- division of labor, specialists and, Eric Colson
- documentation
- domain experts (subject matter experts)
- domain-driven design, Barr Moses and Lior Gavish
- DTW (data time window) batching model, Data Time Window Batching Model
- dual-writes, avoiding, Gunnar Morling-Gunnar Morling
- DWHs (data warehouses), James Densmore-James Densmore, Gleb Mezhanskiy-Zero Maintenance
E
- ELT (extract, load, transform)
- embedded collaboration, Embedded Collaboration at Its Heart
- encryption
- end users, input on data projects from, Andrew Stevenson
- ethics, in handling of user information, Lohit VijayaRenu-Watch Your Data Footprint
- ETL (extract, transform, load)
- best practices for, Christian Lauer-Conclusion
- data latency, Dhruba Borthakur
- end of, Paul Singman-Taking the First Steps
- from data scientist's perspective, Database Administration, ETL, and Such
- implementing reusable patterns in, Implement Reusable Patterns in the ETL Framework
- maintainability and, Chris Moradi-Chris Moradi
- replacing with ITD, Replacing ETL with Intentional Data Transfer-Taking the First Steps
- scaling, Chris Heinzmann-Chris Heinzmann
- European Union (EU), Katharine Jarmul
- Evans, Eric, Barr Moses and Lior Gavish
- event time, Marta Paes Moreira and Fabian Hueske
- eventual consistency, Denise Koessler Gosnell, PhD-Denise Koessler Gosnell, PhD
- execution time, Rustem Feyzkhanov
- extensibility, design patterns for, Mukul Sood-Mukul Sood
F
- facade patterns, Mukul Sood
- failure
- FAQ lists, Emily Riederer
- feedback, from stakeholders, Gain Feedback and Buy-in from the Right Stakeholders
- forecasting data, domain knowledge to interpret, Thomas Nield
- frontend, latent documentation for, Emily Riederer-Emily Riederer
- fundamental knowledge, Pedro Marcelino-Pedro Marcelino
- fundamentals, defined, Pedro Marcelino
H
- Hadoop, Disaggregated Data Stack, Jonathan Seidman, Joe Reis
- haiku approach to writing software, Mitch Seymour-Engage the Creative Side of Your Brain
- hardware, mechanical sympathy and, Tobias Macey-Tobias Macey
- Hariri, Hadi, Thomas Nield
- heroism, reasons to avoid, Don’t Be a Hero
- hidden costs, of data input/output, Lohit VijayaRenu-Data Serialization
- Hive, Jonathan Seidman, Why Does It Happen?
- hope, not relying on, Don’t Rely on Hope
- horizontal scalability, Paul Brebner
I
- infrastructure automation, Christiano Anderson-Christiano Anderson
- ingestion, automation of data validation and, Anthony Burdi
- innovation, data engineering for, Jeff Magnusson-Bake Data-Flow Logic into Tooling and Infrastructure
- integration, proprietary versus open source software, Paige Roberts
- intentional transfer of data (ITD), Replacing ETL with Intentional Data Transfer-Taking the First Steps
- interoperability, data warehouse, Interoperability
- inventory control, eventual consistency and, Denise Koessler Gosnell, PhD-Denise Koessler Gosnell, PhD
- issue resolution, Issue Resolution
- iteration, Eric Colson
L
- late data, Ariel Shaqed-Ariel Shaqed
- latencies
- latent documentation, Emily Riederer-Emily Riederer
- legacy data, data lakes and, Scott Haines-From Generic Data Lake to Data Structure Store
- libraries, compiled, From Generic Data Lake to Data Structure Store
- Lindy effect, Pedro Marcelino
- link attack, Stephen Bailey, PhD
- Linux, threads in, Operating System Threading
- listening, talking versus, Steven Finkelstein-Steven Finkelstein
- log-centric architectures, messages in, Boris Lublinsky-Boris Lublinsky
- logging, data test failure and, Logging and Alerting
- logical tests, for QA, Sonia Mehta
- long-term solutions, short-term needs versus, Joel Nantais-Joel Nantais
M
- machine learning (ML) projects, Dr. Sandeep Uttamchandani-Dr. Sandeep Uttamchandani
- machine learning teams, Matthew Seal
- maintainability, ETL tasks and, Chris Moradi-Chris Moradi
- maintenance, data warehouses and, Zero Maintenance
- MapReduce, MapReduce Algorithm
- Massachusetts General Hospital (MGH), Stephen Bailey, PhD
- mechanical sympathy, Tobias Macey-Tobias Macey
- message passing, Scott Haines-Scott Haines, Message Passing/Actors Model
- message queue as a service, Scott Haines
- messages, defining/managing in log-centric architectures, Boris Lublinsky-Boris Lublinsky
- messaging component of data pipeline, Messaging Component
- messaging systems, prioritizing user experience in, Jowanza Joseph-Jowanza Joseph
- metadata
- metadata services, Lohit VijayaRenu-Application Interface and Service Guarantee
- metrics
- MGH (Massachusetts General Hospital), Stephen Bailey, PhD
- microservice architecture
- minimum viable product (MVP) principle, Bob Haffner
- missingness, Emily Riederer-Emily Riederer
- mistakes, effect on data engineer's career, Bartosz Mikulski-Bartosz Mikulski
- ML (machine learning) projects, Dr. Sandeep Uttamchandani-Dr. Sandeep Uttamchandani
- modern data stack, modern metadata for, Prukalpa Sankar-Embedded Collaboration at Its Heart
- modularity, design patterns for, Mukul Sood-Mukul Sood
- monitoring, Bartosz Mikulski
- MVP (minimum viable product) principle, Bob Haffner
O
- object stores, S3 Compatibility
- observability, Barr Moses-Introducing Data Observability, The Final Link: Observability
- offset tracking, Scott Haines
- open source frameworks
- OpenLineage, Julien Le Dem
- operating system threading, Operating System Threading
- operational lineage, Julien Le Dem
- Orbitz Worldwide, Jonathan Seidman
- orchestration, Orchestrate, Orchestrate, Orchestrate
- OUTER joins, Elias Nema
P
- PACELC theorem, Paul Doran
- parallel processing, message passing and, Scott Haines-Scott Haines
- parallelization, data pipeline construction and, Rustem Feyzkhanov
- patterns, tools versus, Bas Geerdink-Bas Geerdink
- perfection, as enemy of the good, Bob Haffner-Bob Haffner
- pipelines (see data pipelines)
- practical tests, for QA, Sonia Mehta
- practices, tools versus, Bas Geerdink-Bas Geerdink
- predicate pushdowns, Julien Le Dem
- price elasticity, data warehouses and, Price Elasticity
- privacy, Katharine Jarmul-Drop or Encrypt Sensitive Fields, Stephen Bailey, PhD-Stephen Bailey, PhD
- problem solving, data domains and, Matthew Seal-Matthew Seal
- processing time, Marta Paes Moreira and Fabian Hueske
- product, data platform as, Barr Moses and Atul Gupte-Sign Off on Baseline Metrics for Your Data and How You Measure It
- professional identity, dangers of building around a single tool stack, Thomas Nield-Thomas Nield
- Project SWIFT, Steven Finkelstein-Steven Finkelstein
- projection pushdowns, Julien Le Dem
- proprietary software, open source versus, Paige Roberts-Paige Roberts
- pushdowns, Julien Le Dem
R
- random number generators, Dr. Tianhui Michael Li
- RDBMS (see relational databases)
- real-time dashboards, Valliappa (Lak) Lakshmanan-Valliappa (Lak) Lakshmanan
- recognition, for data engineering work, Jesse Anderson-Jesse Anderson
- recovery, repeatability and, Repeatability
- Redshift, James Densmore
- relational databases
- reliability, in data engineering context, Reliability
- remote procedure calls (RPCs), What Are Small Files, and Why Are They a Problem?
- repeatability, Repeatability
- replication lag, Raghotham Murthy
- reproducibility
- reproducible data science projects, engineering of, Dr. Tianhui Michael Li-Dr. Tianhui Michael Li
- return on investment (ROI), Rustem Feyzkhanov
- reusability, design patterns for, Mukul Sood-Mukul Sood
- root cause identification, for data test failure, Root Cause Identification
- RPCs (remote procedure calls), What Are Small Files, and Why Are They a Problem?
- rules, automated enforcement of, Anthony Burdi-Anthony Burdi
- runtime performance, speed of innovation versus, Chris Moradi
S
- S3, object stores and, S3 Compatibility
- scaling and scalability
- schema changes, Dr. Sandeep Uttamchandani
- schema management, Schema Management
- schemas, for NoSQL databases, Kirk Kirkconnell
- security
- sensors, low-cost, Dr. Shivanand Prabhoolall Guness-Dr. Shivanand Prabhoolall Guness
- serialization, hidden costs of, Data Serialization
- shared memory models, Distributed Shared Memory Model
- sharing of data, Thomas Nield-Thomas Nield
- short-term needs, long-term solutions versus, Joel Nantais-Joel Nantais
- shuffle mechanism, MapReduce Algorithm
- silos (see data silos)
- silver-bullet syndrome, Thomas Nield
- simplicity, in software writing, Keep It as Simple as Possible
- site reliability engineering (SRE) teams, Bartosz Mikulski
- small files
- SMEs (see subject matter experts)
- Smith, Adam, Eric Colson
- software engineering
- Spark (see Apache Spark)
- specialists, generalists versus, Eric Colson-Eric Colson
- speed, overoptimizing for, Speed
- SQL (Standard Query Language)
- SQL (Structured Query Language)
- knowledge as prerequisite for data engineer career, Vijay Kiran
- SQL-focused engineering/engineers, Types of Data Engineering
- SRE (site reliability engineering) teams, Bartosz Mikulski
- staging tables, Create and Support Staging Tables
- stakeholders
- storage component of data pipeline, Storage Component
- storage layer, Julien Le Dem-Julien Le Dem
- streaming
- strong consistency, Denise Koessler Gosnell, PhD
- structural patterns, Mukul Sood
- structured data, continued relevance of, SQL and Structured Data Are Still In
- structured queries
- subject matter experts (SMEs)
- sustainability, data platform production and, Prioritize Long-Term Growth and Sustainability over Short-Term Gains
- Sweeney, Latanya, Stephen Bailey, PhD
T
- talking, listening versus, Steven Finkelstein-Steven Finkelstein
- tardy data, Ariel Shaqed-Ariel Shaqed
- TCO (total cost of ownership), Joe Reis
- teams
- avoiding data mutiny, Sean Knapp-Sean Knapp
- common mistakes of data analytics teams, Christopher Bergh-Do DataOps
- cultivating good working relationships with data consumers, Ido Shlomo-Understand Consumers’ Jobs
- data domains and problem solving, Matthew Seal-Matthew Seal
- developing attitude and culture for, Joel Nantais
- effect of growth and specialization on data quality, How Good Data Turns Bad
- embedded collaboration and, Embedded Collaboration at Its Heart
- generalists on, Eric Colson-Eric Colson
- microservices and, Elias Nema-Elias Nema
- potential problems in ML projects, Dr. Sandeep Uttamchandani-Dr. Sandeep Uttamchandani
- replacing ETL with ITD, Replacing ETL with Intentional Data Transfer-Taking the First Steps
- schema changes and, Dr. Sandeep Uttamchandani
- two types of, Types of Data Engineers
- user request handling, Amanda Tomlinson-Amanda Tomlinson
- when data science team failed to produce value, Joel Nantais-Joel Nantais
- technology, total opportunity cost of ownership, Joe Reis-Joe Reis
- testing
- threads/threading
- time-based data, Ariel Shaqed-Ariel Shaqed
- TOCO (see total opportunity cost of ownership)
- tools
- total cost of ownership (TCO), Joe Reis
- total opportunity cost of ownership (TOCO), Joe Reis-Joe Reis
- transaction security, Ensure Transaction Security
- transformations
W
- wait times, Eric Colson
- WAL (write-ahead log), Raghotham Murthy
- watermarks, Marta Paes Moreira and Fabian Hueske
- Wealth of Nations, The (Smith), Eric Colson
- why questions, Bas Geerdink
- windowing, Dean Wampler, PhD
- working relationships, with data consumers, Ido Shlomo-Understand Consumers’ Jobs
- write-ahead log (WAL), Raghotham Murthy
..................Content has been hidden....................
You can't read the all page of ebook, please click
here login for view all page.