How it works...

The first step here is straightforward: we load in the sequences we're interested in and the classes they belong to. Because we're loading the ecoli_protein_classes.txt file into a dataframe, when we need a simple vector, we use the $ subset operator to extract the classes column from the dataframe. Doing so returns that single column in the vector object we need. After this, the workflow is straightforward:

Decide how much of the data should be training and how much should be test: Here, in step 1, we choose 75% of the data as the training set when we create the training_proportion variable. This is used in conjunction with num_seqs in the sample() function to randomly choose indices of the sequences to put into the training set. Thetraining_set_indices variable contains integers that we will use to subset data on later. Initially, we make a complementary list of indices, test_set_indices, by using the square bracket, [], subset notation and the negation operator, -. Basically, this construct is an idiomatic way of creating a vector that contains every index not in training_set_indices.
Construct and train the Support Vector Machine model: In step 2, we build our classifying model. First, we choose a kernel that maps the input data into a matrix space that the Support Vector Machine can learn from. Here, it's from the gappyPairKernel() function—note that there are lots of kernel types; this one is pretty well suited to sequence data. We passkernel along to the kbsvm() function along with the training_set_indices subset of sequences in seqs as the x parameter, and the training_set_indices subset of classes as the y parameter. Other arguments in this function determine the exact model type and package and training parameters. There are lots of options for these and they can have a strong effect on the efficacy of the final model. It's well worth reading up and doing some scientific experimentation on what works best for your particular data. The final model is saved in the model variable.

Test the model on unseen data: Now we have a model, we get to use it to predict classes of unseen proteins. This stage will tell us how good the model is. In step 3, we use the predict() function with the model and the sequences we didn't use to train (the ones in test_set_indices) and get a prediction object back. Running the predictions through the evaluatePrediction() function along with the real classes from the classes vector and also a vector of all possible class labels using the allLabels argument returns a summary of the accuracy and other metrics of the model. We have 62% accuracy in the model here, which is only okay; it's better than random. But we have a rather small dataset and the model isn't optimized; with more work, it could be better. Note that if you run the code, you may get different answers. Since the selection of training set sequences is random, the models might do slightly worse or better depending on the exact input data.
Estimate the prediction profile of a sequence: To actually find the regions that are important in classification, and presumably in the function of the protein, we use the getPredictionProfile() function on a sequence. We do this in step 4 on a small 10 AA fragment extracted from the first sequence using list, double-bracket indexing to get the first sequence and single-bracket indexing to get a range; for example, seqs[[1]][1:10]. We do this simply for the clarity of the visualization in the last step. You can use whole sequences just as well. The getPredictionProfile() function needs the kernel and model objects to function. This will give the following output:

##     1 -1
## 1  36 23
## -1 10 19
## 
## Accuracy:              62.500% (55 of 88)
## Balanced accuracy:     61.749% (36 of 46 and 19 of 42)
## Matthews CC:            0.250
## 
## Sensitivity:           78.261% (36 of 46)
## Specificity:           45.238% (19 of 42)
## Precision:             61.017% (36 of 59)

Finally, we can plot() the prediction profile: The profile shows the contribution of each amino acid to the overall decision and adds to the interpretability of the learning results. Here, the fourth residue, D, makes a strong contribution to the decision made for this protein. By examining this across many sequences, the patterns contributing to the decision can be elucidated. It's worth noting that you may get a slightly different picture to the one that follows—because of random processes in the algorithms—and its something you should build into your analyses: make sure that any apparent differences aren't due to random choices made in the running of the code. The strongest contribution should still come from "D" in this example:

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...