Data transformation and discretization

As we know from the previous section, there are always some data formats that are best suited for specific data mining algorithms. Data transformation is an approach to transform the original data to preferable data format for the input of certain data mining algorithms before the processing.

Data transformation

Data transformation routines convert the data into appropriate forms for mining. They're shown as follows:

  • Smoothing: This uses binning, regression, and clustering to remove noise from the data
  • Attribute construction: In this routine, new attributes are constructed and added from the given set of attributes
  • Aggregation: In this summary or aggregation, operations are performed on the data
  • Normalization: Here, the attribute data is scaled so as to fall within a smaller range
  • Discretization: In this routine, the raw values of a numeric attribute are replaced by interval label or conceptual label
  • Concept hierarchy generation for nominal data: Here, attributes can be generalized to higher level concepts

Normalization data transformation methods

To avoid dependency on the choice of measurement units on data attributes, the data should be normalized. This means transforming or mapping the data to a smaller or common range. All attributes gain an equal weight after this process. There are many normalization methods. Let's have a look at some of them:

  • Min-max normalization: This preserves the relationships among the original data values and performs a linear transformation on the original data. The applicable ones of the actual maximum and minimum values of an attribute will be normalized.
  • z-score normalization: Here the values for an attribute are normalized based on the mean and standard deviation of that attribute. It is useful when the actual minimum and maximum of an attribute to be normalized are unknown.
  • Normalization by decimal scaling: This normalizes by moving the decimal point of values of attribute.

Data discretization

Data discretization transforms numeric data by mapping values to interval or concept labels. Discretization techniques include the following:

  • Data discretization by binning: This is a top-down unsupervised splitting technique based on a specified number of bins.
  • Data discretization by histogram analysis: In this technique, a histogram partitions the values of an attribute into disjoint ranges called buckets or bins. It is also an unsupervised method.
  • Data discretization by cluster analysis: In this technique, a clustering algorithm can be applied to discretize a numerical attribute by partitioning the values of that attribute into clusters or groups.
  • Data discretization by decision tree analysis: Here, a decision tree employs a top-down splitting approach; it is a supervised method. To discretize a numeric attribute, the method selects the value of the attribute that has minimum entropy as a split-point, and recursively partitions the resulting intervals to arrive at a hierarchical discretization.
  • Data discretization by correlation analysis: This employs a bottom-up approach by finding the best neighboring intervals and then merging them to form larger intervals, recursively. It is supervised method.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.255.145