2.5 Summary

■ Data sets are made up of data objects. A data object represents an entity. Data objects are described by attributes. Attributes can be nominal, binary, ordinal, or numeric.

■ The values of a nominal (or categorical) attribute are symbols or names of things, where each value represents some kind of category, code, or state.

■ Binary attributes are nominal attributes with only two possible states (such as 1 and 0 or true and false). If the two states are equally important, the attribute is symmetric; otherwise it is asymmetric.

■ An ordinal attribute is an attribute with possible values that have a meaningful order or ranking among them, but the magnitude between successive values is not known.

■ A numeric attribute is quantitative (i.e., it is a measurable quantity) represented in integer or real values. Numeric attribute types can be interval-scaled or ratio-scaled. The values of an interval-scaled attribute are measured in fixed and equal units. Ratio-scaled attributes are numeric attributes with an inherent zero-point. Measurements are ratio-scaled in that we can speak of values as being an order of magnitude larger than the unit of measurement.

■ Basic statistical descriptions provide the analytical foundation for data preprocessing. The basic statistical measures for data summarization include mean, weighted mean, median, and mode for measuring the central tendency of data; and range, quantiles, quartiles, interquartile range, variance, and standard deviation for measuring the dispersion of data. Graphical representations (e.g., boxplots, quantile plots, quantile–quantile plots, histograms, and scatter plots) facilitate visual inspection of the data and are thus useful for data preprocessing and mining.

■ Data visualization techniques may be pixel-oriented, geometric-based, icon-based, or hierarchical. These methods apply to multidimensional relational data. Additional techniques have been proposed for the visualization of complex data, such as text and social networks.

■ Measures of object similarity and dissimilarity are used in data mining applications such as clustering, outlier analysis, and nearest-neighbor classification. Such measures of proximity can be computed for each attribute type studied in this chapter, or for combinations of such attributes. Examples include the Jaccard coefficient for asymmetric binary attributes and Euclidean, Manhattan, Minkowski, and supremum distances for numeric attributes. For applications involving sparse numeric data vectors, such as term-frequency vectors, the cosine measure and the Tanimoto coefficient are often used in the assessment of similarity.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.233.135