Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Protein classification and the k-Nearest Neighbors algorithm

The k-Nearest Neighbors (kNN) algorithm is one of the lazy learners that postpones the learning until the test tuple or test instance is provided.

A single training tuple is represented by a point in an n-dimensional space. In other words, n attributes' combinations are used to represent the specific training tuple. There is no specific training before the arrival of the test tuple that needs to be classified. Some preprocessing steps are needed, such as normalization for some attributes with large values compared to other attributes' values. Data normalization approaches in the data transformation can be applied here for preprocessing.

When a test tuple is given, the k-nearest training tuples are found from the training tuples space by a specific measure to calculate the distance between test tuple and the training tuple. The k-nearest training tuples are also known as the kNN. One popular solution is the Euclidean distance in real space, illustrated in the following equation. This method is only applicable to numeric attributes:

For nominal attributes, one solution is that the difference between two attribute values is defined as 1, or as 0. We already know that many approaches deal with missing values in the attributes. With a predefined threshold, the value of k is selected with the number of tuples with the lowest error-rate among all the training tuples.

The class label of the test tuple is defined by the voting of the most common class in the kNN.

The kNN algorithm

The input parameters for kNN algorithm are as follows:

D, the set of training objects
z, the test object, which is a vector of attribute values
L, the set of classes used to label the objects

The output of the algorithm is the class of z, represented as .

The pseudocode snippet for kNN is illustrated here:

The I function in line 6 denotes an indicator function that returns the value 1 if its argument is true and 0 otherwise.

The R implementation

Please look up the R codes file ch_04_knn.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_04_knn.R")

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Protein classification and the k-Nearest Neighbors algorithm

Create new playlist

Sign In

Sign Up

Protein classification and the k-Nearest Neighbors algorithm

The kNN algorithm

The R implementation

Table of Contents for
Protein classification and the k-Nearest Neighbors algorithm