Document retrieval and Support Vector Machine

Support Vector Machine (SVM) is a classification algorithm applicable to both linear and nonlinear data classification. It is based on an assumption: if two classes of data cannot be divided by a hyper-plane, then after mapping the source dataset to sufficient higher dimension spaces, the optimal separating hyper-plane must exist.

Here are two concepts that need to be clearly defined:

  • Linearly separable: This means that a dataset can be divided into the target classes with a linear equation with the input of a training tuple.
  • Nonlinearly separable: This means that none of the linear equations exist in the space with the same dimension as that of the training tuple.

The linear hyper-plane can be represented as the linear discriminant equation, given the weight vector w and the training tuple x,

Document retrieval and Support Vector Machine
Document retrieval and Support Vector Machine
Document retrieval and Support Vector Machine

With the preceding equation, we have the following image to illustrate a hyper-plane:

Document retrieval and Support Vector Machine

The target of SVM is to find the optimal hyper-plane, by which the margin between data points belonging to different classes are maximized.

There are two hyper-planes with equal distance and that are parallel to the Document retrieval and Support Vector Machine hyper-plane. They are boundary hyper-planes and all support vectors are on them. This is illustrated in the following diagram:

Document retrieval and Support Vector Machine

In the following diagram, the case of a nonlinearly separable case is illustrated:

Document retrieval and Support Vector Machine

In the following diagram, after mapping the vector from a low-dimensional space to high-dimensional space, the nonlinearly separable case will be transformed into to a linearly separable case:

Document retrieval and Support Vector Machine

The SVM algorithm

The input parameters for a dual SVM algorithm are as follows:

  • D, the set of training objects
  • K
  • C
  • The SVM algorithm

The output of the algorithm is the SVM algorithm. The pseudocode snippet for this algorithm is illustrated here:

The SVM algorithm
The SVM algorithm

Here is the pseudocode for another version of the SVM algorithm, which is the primal kernel SVM algorithm. The input parameters for the primal kernel SVM algorithm are as follows:

  • D, the set of training objects
  • K
  • C
  • The SVM algorithm

The output of the algorithm is the SVM model. The pseudocode snippet for this algorithm is illustrated here:

The SVM algorithm

The R implementation

Please look up the R codes file ch_04_svm.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_04_svm.R")

Parallel version with MapReduce

Along with the online computation requests, such as mobile cloud applications for classification, a high-performance, robust, and high-accuracy classification platform or algorithm is widely required. The parallelized SVM distributes the computing of optimization by the MapReduce technique to achieve working on a large scale of datasets and high performance at the same time. There are many implementations of various versions of parallelized SVM. One of them is shown here.

The t denotes for iteration number, l for MapReduce function size, Parallel version with MapReduce for the best hypothesis at iteration t, Parallel version with MapReduce for sub-dataset for node l, Parallel version with MapReduce for support vectors at node l and Parallel version with MapReduce for global support vector:

Parallel version with MapReduce

Next, the mapper algorithm is designed and the for loop is immediately behind the while loop, which is looped for each subset:

Parallel version with MapReduce

Finally, the reducer algorithm is designed. On line 3, the code inside the for loop immediately follows the while loop. This is for training by merging the datasets to obtain support vectors and binary-class hypothesis:

Parallel version with MapReduce

Document retrieval

One of the important applications of SVM is document retrieval, which has a static information source; the task is to fetch a ranking of documents as a response to a user request. The vector model is a widely implemented document-retrieval or information-retrieval model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.95.197