If the label is discrete, the prediction problem is called classification. In general, the target can take only one of the values for each record (even though multivalued targets are possible, particularly for text classification problems to be considered in Chapter 6, Working with Unstructured Data).
If the discrete values are ordered and the ordering makes sense, such as Bad, Worse, Good, the discrete labels can be cast into integer or double, and the problem is reduced to regression (we believe if you are between Bad and Worse, you are definitely farther away from being Good than Worse).
A generic metric to optimize is the misclassification rate is as follows:
However, if the algorithm can predict the distribution of possible values for the target, a more general metric such as the KL divergence or Manhattan can be used.
KL divergence is a measure of information loss when probability distribution is used to approximate probability distribution :
It is closely related to entropy gain split criteria used in the decision tree induction, as the latter is the sum of KL divergences of the node probability distribution to the leaf probability distribution over all leaf nodes.
3.147.64.68