Data integration combines data from multiple sources to form a coherent data store. The common issues here are as follows:
Heterogeneous data: This has no common key
Different definition: This is intrinsic, that is, same data with different definition, such as a different database schema
Time synchronization: This checks if the data is gathered under same time periods
Legacy data: This refers to data left from the old system
Sociological factors: This is the limit of data gathering
There are several approaches that deal with the above issues:
Entity identification problem: Schema integration and object matching are tricky. This referred to as the entity identification problem.
Redundancy and correlation analysis: Some redundancies can be detected by correlation analysis. Given two attributes, such an analysis can measure how strongly one attribute implies the other, based on the available data.
Tuple Duplication: Duplication should be detected at the tuple level to detect redundancies between attributes
Data value conflict detection and resolution: Attributes may differ on the abstraction level, where an attribute in one system is recorded at a different abstraction level