6
Machine Learning for Modulation Classification

6.1 Introduction

In Chapter 5 we list a collection of signal features for modulation classification. Some of the classification decision making is based on multi-stage decision trees where each stage utilizes a different feature. However, the need for designing the decision tree and optimization of multiple decision thresholds is not most convenient.

To overcome these problems, various machine learning techniques have been employed to accomplish two major tasks in feature-based modulation classification. First, the machine learning techniques can provide a classification decision making process that is much easier to implement. Second, the machine learning techniques can reduce the dimension of the feature set. This is achieved by feature selection and feature generation, which enables the consideration of a more versatile feature set while maintaining the computational efficiency of the classifier.

In this chapter we first introduce two machine learning-based classifiers, namely k-nearest neighbour classifier and support vector machine classifier, for modulation classification in combination with the features listed in Chapter 5. Next, the issue of feature space dimension reduction is explored through different algorithms including linear regression, artificial neural network, genetic algorithm and genetic programming.

6.2 K-Nearest Neighbour Classifier

The k-nearest neighbour (KNN) classifier is a non-parametric algorithm which assigns the class to a testing signal by analyzing the number k of nearest reference signals in the feature space. It has been used to solve many different classification problems (Guo and Nandi, 2006; Guo, Zhang and Nandi, 2007; Espejo, Ventura and Herrera, 2010). There are three main steps in a KNN classifier.

6.2.1 Reference Feature Space

To enable KNN classification, a reference feature space must be established first. The feature space should include M reference values of each feature from each modulation class. The selection of M depends on the problem and is normally optimized heuristically. The motivation for a larger value for M is that the reference feature space provides a more accurate representation of the likely distribution of the testing signal features. On the other hand, a larger M-value is likely to impose a higher computational complexity in the later steps of the KNN classifier.

For modulation classification, Zhu et al. suggested the use of training data from the same signal source for the generation of reference feature values (Zhu, Aslam and Nandi, 2010). The advantage of this approach is that the training signal shares the same source as the testing signal. Thus the reference feature space is an accurate representation of the feature distribution of the testing signal. Meanwhile, the construction of the reference feature space is really easy as the only step required is to calculate the feature values for the training signals. However, because of the random nature of the training signal, one cannot guarantee the accuracy of the feature space to be high enough.

Synthesized reference values are more controlled over the construction of the reference feature space. Nevertheless, there needs to be a hypothesized feature distribution which is often not reliable.

6.2.2 Distance Definition

Since the classifier requires the evaluation of distances between the testing signal and the reference signals, a distance metric must be defined before a search of neighbouring reference signals can be achieved. There are many different metric systems that can be used for distance measurement in a KNN classifier. Euclidean distance is one of the most common distance metrics for KNN classifiers. Given a feature set images with L number of features, the Euclidean distance between the feature sets of signals A and B is calculated as given in equation (6.1).

(6.1)images

Having established the distance measurement, the classification decision is achieved by finding the k nearest number of reference samples and analyzing the demography with these k number of samples.

6.2.3 K-Nearest Neighbour Decision

When the distances between the test signal and all reference signals are evaluated, a number of k reference signals are recorded as the k nearest neighbours. The selection of the value of k should follow these rules.

  1. The value should ideally be a prime number, to avoid the case where the k neighbours consist of equal numbers of reference signals from different classes.
  2. The value should be less than the total number of reference signals from a signal class.
  3. The value of k should be big enough to avoid false classification caused by outliers.

The actual optimization of the value k can be heuristic because it has been shown that the classification does not vary much if the k value is in a reasonable range. The end classification result is achieved by finding the majority of the k nearest neighbours that share the same class. This class will be assigned to the testing signal as the classification result. A pseudo code for the KNN classifier implementation is given below.

The KNN is non-parametric and capable of multi-class classification. However it suffers with an increasing number of features, which raises the dimension of the feature space and the complexity of the distance calculation. Therefore, some sort of dimension reduction is needed to make this method viable. Another disadvantage of the KNN classifier is that the features contribution to the classification decision making is not weighted. Therefore, one feature with relatively sparse distribution between different modulations may come to dominate the distance evaluation. The classification of some modulations relying on other features may be affected.

6.3 Support Vector Machine Classifier

Support vector machine (SVM) provides another way to achieve classification in the existing multi-dimensional feature space. It has been adopted for the classification of many different data sets (Mustafa and Doroslovacki, 2004; Polat and Güneş, 2007; Akay, 2009). SVM achieves classification by finding the hyperplane that separates data from different classes. The hyperplane, meanwhile, is optimized by maximizing its distance to the signal samples on each side of the hyperplane. Depending on the nature of the signal being classified, the SVM classifiers can be divided into linear and non-linear versions.

The linear SVM classifiers have linear kernels. The kernel is defined by equation (6.2),

(6.2)images

where x = [x1xK] is the input feature vector images and w = [w1wK] is the weight vector to be optimized. The kernel defines a linear separation hyperplane (Theodoridis, 2008), as given in equation (6.3),

(6.3)images

where w0 is a constant. The classification of a two-class (between modulation candidates image;(a) and image;(b)) problem is achieved by simply using the sign of g(x), as shown in equation (6.4).

(6.4)images

To obtain the weight through training, the following optimization process [equation (6.5)] is exercised:

(6.5)images
(6.6)images

where yi is the class indicator for the ith feature vector (+1 for image;(a) and −1 for image;(b)). An illustration of an SVM for a two-class problem is given in Figure 6.1.

c6-fig-0001

Figure 6.1 Two-class feature space with linear support vector machine.

The non-linear version of SVM classifier shares the same training and classification process, except that the kernel used for the hyperplane is replaced by a non-linear kernel. We have tested in the past that a polynomial kernel is enough to provide effective classification. The polynomial kernel is given by equation (6.7),

(6.7)images

where d is the degree of the polynomials.

A general procedure of the SVM classifier for AMC is given below.

Compared with the KNN classifier, the SVM classifier only needs to use the training signal when establishing the separating hyperplane. Once the hyperplane is optimized, there is no need to involve the training signal in any sort of further calculation. The benefit is that the computation needed at the testing stage is relatively inexpensive compared with KNN. However, the SVM classifier is most natural for two-class classification. There are implementations of a multi-class classification using SVM; however, the implementation is much less intuitive than in the two-class case.

6.4 Logistic Regression for Feature Combination

For both KNN and SVM classifiers, it is always preferable to have as many features as possible for improving the classification accuracy. However, both classifiers suffer when the number of features increases. That is why reducing the feature space dimension is necessary. Using machine learning algorithms, there are two ways to do so. First, feature space dimension can be reduced by eliminating some of the features which make less or no contribution to the classification task. Second, feature space dimension can be reduced by combining the existing feature into fewer new features.

While feature selection is an effective way to reduce the complexity for a feature-based modulation classifier, the elimination of a feature can sometimes be destructive. That is without mentioning that sometimes the features are all useful in some degree and the elimination of any feature can be destructive for the classification performance. In this case, a more conservative approach is needed for dimension reduction. That is why feature combination has been considered for not just the reduction of feature dimension but also for enhancing the performance of these features.

To begin with, a linear combination of the features is the simplest but often a very effective way of the combining the features. Assuming we are combining k number of existing features into a single new feature; the linear combination is given by equation (6.8),

(6.8)images

where wk is the weight of the kth feature images, w0 is a constant, and K is the total number of features available for combination. The process to optimize these weights is called logistic regression and it aims to maximize the difference of the new feature value between different classes. It has been adopted by Zhu et al. in the dimension reduction for distribution-based features (Zhu, Nandi and Aslam, 2013).

There are two common logistic regression tools in the family of generalized linear regression algorithms, namely binomial logistic regression and multinomial logistic regression. The binominal logistic regression is designed to project the signal using a logistic function p(·) such that p(·) equals 1 when the signal is modulated using image;(i) and 0 if the signal modulation is using image;(j). The logistic function is given in equation (6.9),

(6.9)images

where images is the collection of existing features and g(·) is the logit function, the inverse of the logistic function p(·), given by equation (6.10).

(6.10)images

The estimation of each of the parameters B0 and Bk is often achieved using iterative processes such as the Newton–Raphson method (Hosmer and Lemeshow, 2000). The resulting estimation can be used to substitute the weights in equation (6.8).

Logistic regression provides a basic tool for feature selection and combination. However, multi-class classification is not always suited for linear regression-assisted feature selection and combination. It is sometimes better to divide the classification into multiple steps.

6.5 Artificial Neural Network for Feature Combination

Nandi and Azzouz pioneered in the field of machine learning for modulation classification by introducing the artificial neural network (ANN) for improved decision making (Nandi and Azzouz, 1997). They first proposed to use the ANN algorithm in conjunction with signal spectral-based features for the classification of analogue and digital modulations. It was afterwards adopted by many other researchers as a tool for feature selection and combination.

Unlike the decision tree given in Section 5.2, an ANN classifier does not separate the decision making into multiple stages. Instead, it can be used to consolidate the existing features and create a non-linear mapping of these features to afford new features of reduced dimensionality and enhanced performance.

For a single-layer perceptron network, the trained network is similar to a linear combination of the input features. The same representation can be given as equation (6.11).

(6.11)images

Multi-layer perceptron (MLP) is one of the most popular mapping schemes because of its simplicity and efficient hardware implementation. MLP is a feed-forward structure of interconnection of individual non-linear parallel computing units called neurons. Inputs are propagated through the network layer by layer and MLP gives a non-linear mapping of the inputs at the output layers. Use for feature combination the MLP can be expressed as equation (6.12),

(6.12)images

Where yk is the output of the MPL network and the feature combination images being optimized at the kth output node, φ is the activation function, wij is the weight value from neuron j to neuron i, and xj is the jth input feature images from the feature set. A visual illustration of the MPL network for feature combination is given in Figure 6.2.

c6-fig-0002

Figure 6.2 Multilayer perceptron neural network for AMC feature combination.

Azzouz and Nandi (1995) adopted the back propagation (BP) to training the weights. The BP algorithm trains the weight through an iterative process by calculating the change of each weight with respect to the error function E. As an example, the mean squared error (MSE) function gives equation (6.13),

(6.13)images

where yi is the output and ui is the weighted sum of the inputs of neuron i. The weight value can then be updated using a gradient descent approach as shown in equation (6.14),

(6.14)images

where ε is the learning rate which dictates the speed of convergence. With a high learning rate, the convergence is faster. However, it is done with the risk of oscillation. With a low learning rate, many more iterations will be needed to reach convergence.

6.6 Genetic Algorithm for Feature Selection

To overcome the issue of high dimensionality in the feature space, Wong and Nandi suggested the use of a genetic algorithm (GA) as a tool for reducing the number of features (Wong and Nandi, 2004). They used binary strings to represent the selection of different features. If there are five existing features, a binary string example could be 11000, which means that the first two features are selected for classification and the last three features are neglected.

The training of such binary strings begins with a randomly generated string. According to the initial binary string, features are selected for modulation classification with some training data. The resulting classification performance achieved by these selected features is then used as a criterion for evaluating the performance of the binary string. Based on their performance, better binary strings are selected for the evolutionary process of producing new binary strings that migrate toward the optimal solution or optimal selection of features. The two genetic operators used are crossover and mutation.

For crossover, we assume there are two parent binary strings, 11011 and 01000. The crossover would randomly choose equal numbers of bits in both parents and swap their values. In the given case, if the first four digits are selected then the ‘children’ of the crossover operation would be 01001 and 11010, and these represent two new sets of selected features (Figure 6.3).

c6-fig-0003

Figure 6.3 Crossover in genetic algorithm.

Meanwhile, mutation utilizes only one parent, for example, 11011. The operation is the process of selecting random digits in the parent string and generating a random value for that digit. Using the example, if the mutation operation selects the first, third, and fourth digits of the binary string, the resulting child string would become 01111. Since the new value is randomly generated it could be the same as the parent value as seen in the fourth digit or different as seen in the first and third digit (Figure 6.4).

c6-fig-0004

Figure 6.4 Mutation in genetic algorithm.

The processes of fitness evaluation, parent selection, and reproduction are repeated for a predefined number of generations, after which the GA is terminated. Termination can also be triggered if the average or best fitness in the current generation reaches a predefined threshold or the improvement over the last few generations becomes lower than a predefined threshold.

In the end, the binary strings in all generations are ranked by their fitness. The string with the highest fitness is selected as the final product of the GA process. According to the nature of the binary string, the features can be selected subsequently. It is worth mentioning that the GA process can be highly random because of the random initialization and mutation operation. It is sometimes recommended to repeat the GA process several times and to produce a few sets of different feature selections from which the best feature selection can a determined by another test.

6.7 Genetic Programming for Feature Selection and Combination

Koza popularized the genetic programming (GP) as another evolutionary machine learning algorithm (Koza, 1992). It has since been used for classification of many different types of data and signal (Espejo, Ventura and Herrera, 2010). Zhu et al. first employed GP for modulation classification feature selection and combination (Zhu, Aslam and Nandi, 2010). Zhu et al. also extended the application of GP in modulation classification by combining GP with other machine learning algorithms to achieve improved classification performance (Zhu, Aslam and Nandi, 2011; Aslam, Zhu and Nandi, 2012).

GP belongs to the class of evolutionary algorithms which attempt to emulate a Darwinian model of natural evolution. It is a machine learning methodology that is used to optimize a population of individuals (computer programs) with the help of fitness values. GP develops the solution of a problem in the form of a mathematical formula. Each solution is a computer program and can be represented in the form of a tree. Each tree has terminal nodes (data nodes) and internal nodes (function nodes). Each individual is given a fitness value which quantifies its ability to solve the given problem. The fitness value is computed using a user-defined fitness function. The fitness function used depends upon the nature of the problem. The advantages GP has on other machine learning methods are listed below. (i) No prior knowledge about the statistical distribution of data is needed. (ii) Preprocessing of data is not required and data can be used directly by GP in its original form. (iii) GP returns a mathematical function as output which can be used directly in the application environment. (iv) GP has the inherent capability to select useful features and to ignore others. Typically, GP implementation follows the following steps: (i) GP starts with a randomly generated population of user-defined size. (ii) Each individual is assigned a fitness value which represents the strength of the individual to solve the given problem. (iii) A genetic operator is applied on the current generation to give birth to individuals of the next generation. Genetic operators are explained in the next section. (iv) All the individuals are given fitness values and those individuals having better fitness values get transferred to the next generation. (v) Steps (iii) and (iv) are repeated till a desired solution is achieved. Otherwise GP is terminated after a certain number of generations set by the user.

6.7.1 Tree-structured Solution

There are different ways to represent the individuals (computer programs) in GP. One of the common representations is a tree representation and the same representation has been used here as well. A tree has terminal nodes, internal nodes and output node. Terminal nodes represent the inputs, and internal nodes represent the functions operating on inputs, while the output node gives the output of the tree. An example of a tree-structured mathematical formula (A + B) × C is given in Figure 6.5. In the case of modulation classification feature selection and combination, the input nodes are the selected raw feature. The output node represents the desired new feature combination.

c6-fig-0005

Figure 6.5 Tree-structured individual of genetic programming.

6.7.2 Genetic Operators

Genetic operators are used for reproducing new individuals from older individuals. The operation mimics the genetic processes observed in genetic science. The traditional operators included in a standard GP are crossover and mutation. Semantically, crossover is intended for the sharing of fitter parts of two different individuals in order to create a new individual which is fitter than both parents. Meanwhile, mutation generates new individuals by replacing a random branch of a parent with a randomly generated new branch in the hope that the child will have better fitness than the parent. Practically, the sematic motive of crossover and mutation is implemented with random symbolic processes. We shall use a simple example to illustrated how crossover and mutation is achieved in standard GP.

Let us assume that there are two parent trees, each representing a mathematical formula as shown in Figure 6.6.

c6-fig-0006

Figure 6.6 Parent trees selected for crossover operation.

The first step of crossover randomly selects a branch in each parent tree. The selected branch is highlighted in Figure 6.6 with dashed lines. In the second step, the selected branches are swapped between the two parents, thereby creating two new individuals as shown in Figure 6.7.

c6-fig-0007

Figure 6.7 Child trees produced by crossover from trees in Figure 6.6.

For mutation, we need select only one tree, as shown in Figure 6.8.

c6-fig-0008

Figure 6.8 Parent tree selected for mutation and a randomly generated branch.

The first step of the mutation selects a random branch from the parent tree. In the second step, a new branch is randomly generated. Finally, the mutation is completed by attaching the randomly generated new branch to the same position where the old branch was removed from. The resulting tree is the child tree of a mutation operation.

6.7.3 Fitness Evaluation

Fitness evaluation is the most important design component because it is directly related to the evaluation of how well an individual in the evolution solves the given problem. If a flawed fitness criterion is used, regardless of how efficient the GP is, the end solution will deviate from the goal of the entire system.

For modulation classification, as we dedicated GP as a feature selector and generator, the goal is to generate a combination of selected features which provides fast and accurate modulation classification. Because of the nature of the task, there have been two different approaches to define the fitness function. The first approach was to evaluate the quality of the new feature by measuring the inter-class tightness and intra-class separation given some training signals. To achieve such evaluation, Aslam et al. proposed to use Fisher’s criterion as the fitness function for GP (Aslam, Zhu and Nandi, 2011). Assuming there are L number of signal realizations from two different modulations A and B, a new feature acquired through GP can be calculated for each signal realization. Therefore, we have two sets of feature values, images and images. To calculate the fitness of this new feature, the following fitness function [equation (6.15) is employed based on Fisher’s criterion (Fisher, 1936),

(6.15)images

where μA and μB are the means of the two sets of the feature values, and images and images are the corresponding variances, as defined in equations (6.16) and (6.17).

(6.16)images
(6.17)images

It is obvious in equation (6.15) that the numerator measures the separation of the features from different modulation signals and the denominator measure the tightness of the features from the same modulation signals. Therefore, the fitness function matches the desired property of an effective feature for modulation classification.

However, there are two drawbacks to Fisher’s criterion for fitness evaluation. First, the criterion was developed with the assumption of the statistic being normally distributed. In the case of GP-generated features, it is very difficult to establish the distribution of a new feature because the features can be a very complicated combination of many existing features. That is without mentioning the need of normality for the new feature distribution which varies dramatically because of the random nature of GP. The genetic operators constantly maintain the diversity in the populations, resulting in new features of diverse distributions. Secondly, in practice, there are cases where the trained new features may converge to have a minimum amount of difference in their mean difference while having very small variance. In other cases the new features can have very big mean differences with the variance also being infinitely big.

Meanwhile, there is a second approach which does not share the flaws of the Fisher’s criterion-based fitness evaluation. As the ultimate goal for the new feature is to enhance classification performance, Zhu et al. used a small set of training signals in the GP evaluation and incorporated a computationally efficient classifier in the fitness evaluation (Zhu, Aslam and Nandi, 2010). The fitness, in this case, was evaluated by directly classifying the training signals with a classifier from which the average classification accuracy was used as the fitness value.

6.8 Conclusion

In this chapter, several machine learning algorithms are presented which accomplish modulation classification and their feature-enhancement approaches are described. KNN and SVM are two supervised learning algorithms which enable classification of modulations by using multiple input features. The KNN classifier is easier to implement for multi-class problems, while the SVM classifier provides a separation hyperplane that is optimized using training data to easily achieve two-class classification. ANN provides the possibility to combine different features into a smaller figure set. GA using binary coding is considered as a feature selector. GP has both the ability to select features as well as the possibility of creating flexible feature combinations.

References

  1. Akay, M.F. (2009) Support vector machines combined with feature selection for breast cancer diagnosis. Expert Systems with Applications, 36 (2), 3240–3247.
  2. Aslam, M.W., Zhu, Z. and Nandi, A.K. (2011) Robust QAM Classification Using Genetic Programming and Fisher Criterion. European Signal Processing Conference (EUSIPCO), 28 August 2011, Eurasip, pp. 995–999.
  3. Aslam, M.W., Zhu, Z. and Nandi, A.K. (2012) Automatic modulation classification using combination of genetic programming and KNN. IEEE Transactions on Wireless Communications, 11 (8), 2742–2750.
  4. Azzouz, E.E. and Nandi, A.K. (1995) Automatic identification of digital modulation types. Signal Processing, 47 (1), 55–69.
  5. Espejo, P.G., Ventura, S. and Herrera, F. (2010) A survey on the application of genetic programming to classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 40 (2), 121–144.
  6. Fisher, R.A. (1936) The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7 (2), 179–188.
  7. Guo, H. and Nandi, A.K. (2006) Breast cancer diagnosis using genetic programming generated feature. Pattern Recognition, 39 (5), 980–987.
  8. Guo, H., Zhang, Q. and Nandi, A.K. (2007) Feature Generation Using Genetic Programming Based on Fisher Criterion. European Signal Processing Conference, 3 September 2007, Eurasip, pp. 1867–1871.
  9. Hosmer, D.W. and Lemeshow, S. (2000) Applied Logistic Regression, John Wiley & Sons, Inc., New York.
  10. Koza, J. R. (1992) Genetic Programming: On the Programming of Computers by Means of Natural Selection, The MIT Press, Cambridge.
  11. Mustafa, H. and Doroslovacki, M. (2004) Digital Modulation Recognition Using Support Vector Machine Classifier. Asilomar Conference on Signals, Systems and Computers, November 7–10, 2004, IEEE, pp. 2238–2242.
  12. Nandi, A.K. and Azzouz, E.E. (1997) Modulation recognition using artificial neural networks. Signal Processing, 56 (2), 165–175.
  13. Polat, K. and Güneş, S. (2007) Breast cancer diagnosis using least square support vector machine. Digital Signal Processing, 17 (4), 694–701.
  14. Theodoridis, S. (2008) Pattern Recognition, 4th edn, Academic Press, Boston.
  15. Wong, M.L.D. and Nandi, A.K. (2004) Automatic digital modulation recognition using artificial neural network and genetic algorithm. Signal Processing, 84 (2), 351–365.
  16. Zhu, Z., Aslam, M.W. and Nandi, A.K. (2010) Augmented Genetic Programming for Automatic Digital Modulation Classification. IEEE International Workshop on Machine Learning for Signal Processing, 29 August 2010, IEEE, pp. 391–396.
  17. Zhu, Z., Aslam, M.W. and Nandi, A.K. (2011) Support Vector Machine Assisted Genetic Programming for MQAM Classification. International Symposium on Signals, Circuits and Systems, 30 June 2011, IEEE, pp. 1–6.
  18. Zhu, Z., Nandi, A.K. and Aslam, M.W. (2013) Robustness Enhancement of Distribution Based Binary Discriminative Features for Modulation Classification. IEEE International Workshop on Machine Learning for Signal Processing, 22 September 2013, IEEE, pp. 1–6.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.214.141