Staging area

In a classic data warehouse, this zone is usually a database and/or a schema in it that used to hold a copy of the data from the source systems. The staging area is necessary because most of the time, data sources are not stored on the same server as the data warehouse. Even if they are on the same server, we prefer a copy of them for the following reasons:

  • Preserve data integrity. All data is copied over from a specific point in time. This ensures that we have consistency between tables.
  • We might need specific indexes that we could not create in the source system. When we query the data, we're not necessarily making the same links (joins) in the source system. Therefore, we might have to create indexes to increase query performance.
  • Querying the source might have an impact on the performance of the source application. Usually, the staging area is used to bring just the changes from the source systems. This prevents processing too much data from the data source.

Not to mention that the data source might be files: CSV, XML, and so on. It’s much easier to bring their content in relational tables. From a modern data warehouse perspective, this means storing the files in HDFS and separating them using dates.

In a modern data warehouse, if we’re in the cloud only, relational data can still be stored in databases. The only difference might be in the location of the databases. In Azure, we can use Azure SQL tables or Azure data warehouse.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.177.14