ETL – extract transform load

In any typical data-mining application, the data processing phase is broken into three stages: extract, transform, and load. This is an architectural pattern that helps in separating the three big concerns of a data mining project. The reason is simple: most of the effort is always spent in cleaning and organizing the data. Because garbage-in means garbage-out, it become essential to ensure that the data we are feeding to a learning algorithm is clean. Here are some brief descriptions of the three stages of an ETL pipeline.

Extract

At this stage, data is obtained from different data-sources. For example, there could be web-server logs residing on the filesystem, or the customer data residing on a database server, or products data residing on a separate application altogether. So, in this stage, we fetch data from all these sources.

In our case, every restaurant owner might have additionally uploaded a PDF/Word document to the Entree website. However, it is not well structured, since it is all free text. So we may need to extract data out of these files.

Transform

This stage involves transforming data into a common and consistent format. For example, the data from different sources might be in different metric units (meter versus feet, or USD versus INR, and so on). Also, the data storage format might be different (CSV, XML, JSON, and so on). Several rules and transformers are be applied in this stage to bring all the data sources into common format.

Another typical step in this stage is to perform de-duplication (or entity-resolution). The data from different sources might represent essentially the same product. However, their representation could be different. For example, a product might have only 10 features with slightly different description in one data source, but it may have only five features in another data sources.

Load

Finally, once all the data is processed and brought down into single format, it is stored into a data store. This data store could be a very sophisticated database, or simply a plain text file. By this time, the system should have ensured that the data presented to a learning algorithm is free from missing data or inconsistent data, and that the data is present in a format that is efficient to process.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.133.61