2.4. Sources of Data

Returning to the example of the report from Table 2.1, it is vital to understand how the fact table (i.e., Table 2.2) and other tables were filled in with data. For instance, in order to calculate the sales quantity for a given product, one must

  1. check which transactional data system the source data (i.e., the invoices or orders, on the basis of which one can specify the quantity of sales) are stored in.
  2. determine the interface between the system with the source data and the data warehouse and enter suitable data. Information from the given day should be added to the previous data to extend the history of sales by another day.
  3. aggregate the data from the invoices, which are single transactional business events, to show the results for a whole month; take, for instance, the first row (Sales_Quantity in May 2008 for PC ABC) in Table 2.2. To calculate profitability, you must integrate sales data with the data related to costs, which are downloaded from a financial controlling system.

Figure 2.2. A demonstration dimensional model (star model).

Source: author.

This simplified scenario shows the key issues connected with designing and using data warehouses. In general, there are the following sources of data:

  • Data files. These could be both textual and electronic (e.g., Excel files).
  • Transactional systems.10 These are systems that, by making repeatable transactions, support the basic business processes of a company. The most popular among them are information management systems such as the following:
    • Enterprise resource planning (ERP), which supports companies in terms of finance, sales and logistics, warehouse management, production, procurement, human resource management, and so on
    • Banking systems for handling customers’ accounts
    • Billing systems in telecommunications and energy companies
  • Internet data. These could be both data taken from Internet pages and, for example, data mined by linking to Internet services that automatically transfer suitable information in the form of a subscription
  • Other data warehouses.

It should be stressed that the sources of data may be located inside (and those are usually the most important) and also outside companies. The next chapter looks at where this information may come from.

Data from data sources are entered into a warehouse. This seemingly straightforward process usually consists of three phrases—extract-transform-load (ETL)—that are chronologically linked:

  1. Extraction of data from sources of data. The identification of the sources of data is a complex issue, and you have to specify precisely the information that should be taken to obtain the suitable data (facts and dimensions). You have to refer to unambiguous business terms and connect them with specific data in the system. In some situations, it may turn out that not all the necessary data are registered. As a result, many times you are forced to prepare the source yourself, which is a very costly additional task.
  2. Transformation of data to the form that allows for appropriate business application. This problem is related to, among other things, the integration of data:11
    • Integration of formats—that is, unification of formats of data coming from different systems. For instance, gender can be recorded in a number of different ways: m/f, M/F, male/female, and so on.
    • Semantic integration—that is, a uniform interpretation of data. The professional analysis of business requirements allows you to clarify, for instance, whether the sales are calculated on the basis of invoices or orders. If the sales are calculated on the basis of orders, it allows you to clarify whether this means orders received, carried out, and so on. If you are not precise at this stage and you do not work out an unambiguous definition of the term “sales,” the problem of different versions of truth may arise. This applies also to tables with dimensions, particularly if a source system is not integrated and thus does not have standardized master data. Usually, the problem of uniform and unambiguous definitions for all terms used in warehouse is related to the issue of representation of metadata.

      Moreover, this also includes any additional activities, including, among others, the verification of the quality of the data and its preliminary aggregation before its input into the warehouse. The verification of the quality of the data consists of checking if a given date meets previously defined criteria (e.g., you can check whether personal details, including the date of birth of employees, fit in a rational range of values). The preliminary aggregation of data is, for instance, the input not of single transactions from the store’s cash registers (i.e., receipts) but of summaries of the sales of each product for the whole day.

  3. Loading of data into a warehouse. Normally, the loading of data into the warehouse is not a single action but a repeated process, in which data are added to the warehouse at strictly defined time intervals, depending on business needs: once per day, once per month, and so on. Consequently, the warehouse contains the history of business events that enables you to scrupulously analyze their changes over time.

This information demonstrates the importance of properly administering a data warehouse. To sum up, you should pay special attention to the significant difference between source data and data stored into a warehouse. The first type of data concerns operational processing in day-to-day activities in an enterprise, whereas the second one is related to management information, often at a strategic level, which enables you to gain insights into the company’s performance from a number of angles.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.79.84