Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Document retrieval and Support Vector Machine

Support Vector Machine (SVM) is a classification algorithm applicable to both linear and nonlinear data classification. It is based on an assumption: if two classes of data cannot be divided by a hyper-plane, then after mapping the source dataset to sufficient higher dimension spaces, the optimal separating hyper-plane must exist.

Here are two concepts that need to be clearly defined:

Linearly separable: This means that a dataset can be divided into the target classes with a linear equation with the input of a training tuple.
Nonlinearly separable: This means that none of the linear equations exist in the space with the same dimension as that of the training tuple.

The linear hyper-plane can be represented as the linear discriminant equation, given the weight vector w and the training tuple x,

With the preceding equation, we have the following image to illustrate a hyper-plane:

Document retrieval and Support Vector Machine

The target of SVM is to find the optimal hyper-plane, by which the margin between data points belonging to different classes are maximized.

There are two hyper-planes with equal distance and that are parallel to the hyper-plane. They are boundary hyper-planes and all support vectors are on them. This is illustrated in the following diagram:

In the following diagram, the case of a nonlinearly separable case is illustrated:

In the following diagram, after mapping the vector from a low-dimensional space to high-dimensional space, the nonlinearly separable case will be transformed into to a linearly separable case:

The SVM algorithm

The input parameters for a dual SVM algorithm are as follows:

D, the set of training objects
K
C

The output of the algorithm is the SVM algorithm. The pseudocode snippet for this algorithm is illustrated here:

Here is the pseudocode for another version of the SVM algorithm, which is the primal kernel SVM algorithm. The input parameters for the primal kernel SVM algorithm are as follows:

D, the set of training objects
K
C

The output of the algorithm is the SVM model. The pseudocode snippet for this algorithm is illustrated here:

The R implementation

Please look up the R codes file ch_04_svm.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_04_svm.R")

Parallel version with MapReduce

Along with the online computation requests, such as mobile cloud applications for classification, a high-performance, robust, and high-accuracy classification platform or algorithm is widely required. The parallelized SVM distributes the computing of optimization by the MapReduce technique to achieve working on a large scale of datasets and high performance at the same time. There are many implementations of various versions of parallelized SVM. One of them is shown here.

The t denotes for iteration number, l for MapReduce function size, for the best hypothesis at iteration t, for sub-dataset for node l, for support vectors at node l and for global support vector:

Next, the mapper algorithm is designed and the for loop is immediately behind the while loop, which is looped for each subset:

Finally, the reducer algorithm is designed. On line 3, the code inside the for loop immediately follows the while loop. This is for training by merging the datasets to obtain support vectors and binary-class hypothesis: