K-nearest neighbors

K-nearest neighbors (or KNN) is a supervised method. Like the prior methods we saw in this chapter, the goal is to find a function predicting an output, y, from an unseen observation, x. Unlike a lot of other methods (such as linear regression), this method doesn't use any specific assumption about the distribution of the data (it is referred to as a non-parametric classifier).

The KNN algorithm is based on comparing a new observation to the K most similar instances. It can be defined as a distance metric between two data points. One of the most used frequently methods is the Euclidean distance. The following is the derivative:

d(x,y)=(x1−y1)^2+(x2−y2)^2+…+(xn−yn)^2

When we review the documentation of the Python function, KNeighborsClassifier, we can observe different types of parameters:

One of them is the parameter, p, which can pick the type of distance.

  • When p=1, the Manhattan distance is used. The Manhattan distance is the sum of the horizontal and vertical distances between two points.
  • When p=2, which is the default value, the Euclidean distance is used.
  • When p>2, this is the Minkowski distance, which is a generalization of the Manhattan and Euclidean methods. d(x,y)=(|x1−y1|^p+|x2−y2|^p+…+|xn−yn|^p)^1/p.

The algorithm will calculate the distance between a new observation and all the training data. This new observation will belong to the group of K points that are the closest to this new observation. Then, condition probabilities will be calculated for each class. The new observation will be assigned to the class with the highest probability. The weakness of this method is the time to associate the new observation to a given group.

In the code, in order to implement this algorithm, we will use the functions we declared in the first part of this chapter:

  1. Let's get the Google data from January 1, 2001 to January 1, 2018:
goog_data=load_financial_data(start_date='2001-01-01',
end_date = '2018-01-01',
output_file='goog_data_large.pkl')

  1. We create the rule when the strategy will take a long position (+1) and a short position (-1), as shown in the following code:
X,Y=create_trading_condition(goog_data)
  1. We prepare the training and testing dataset as shown in the following code:
X_train,X_test,Y_train,Y_test=
create_train_split_group(X,Y,split_ratio=0.8)
  1. In this example, we choose a KNN with K=15. We will train this model using the training dataset as shown in the following code:
knn=KNeighborsClassifier(n_neighbors=15)
knn.fit(X_train, Y_train)

accuracy_train = accuracy_score(Y_train, knn.predict(X_train))
accuracy_test = accuracy_score(Y_test, knn.predict(X_test))
  1. Once the model is created, we are going to predict whether the price goes up or down and store the values in the original data frame, as shown in the following code:
goog_data['Predicted_Signal']=knn.predict(X)
  1. In order to compare the strategy using the KNN algorithm, we will use the return of the GOOG symbol without das shown in the following code:
goog_data['GOOG_Returns']=np.log(goog_data['Close']/
goog_data['Close'].shift(1))

cum_goog_return=calculate_return(goog_data,split_value=len(X_train),symbol='GOOG')
cum_strategy_return= calculate_strategy_return(goog_data,split_value=len(X_train))

plot_chart(cum_goog_return, cum_strategy_return,symbol='GOOG')

This code will return the following output. Let's have a look at the plot:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.89.56.228