Machine learning using Elasticsearch and scikit-learn

Scikit-learn is a Python machine learning library built on the top of NumPy, SciPy, and Matplotlib. It provides simple tools for data mining and data analysis. According to the description on its website (see https://scikit-learn.org/stable/), we can use it in six major areas:

Classification: A supervised learning approach for learning given data and using it to generate a model for a classifier. Then, we use the model to predict new data in order to identify the category with the classifier.
Regression: Using a statistical methodology to predict continuous values using a given set of data.
Clustering: Grouping data into different categories.
Dimensionality reduction: Reducing the dimension of the data.
Model selection: Tuning the hyperparameters of the model.
Preprocessing: Feature extraction and normalization.

In the last section, Machine learning with the Elastic Stack, we practiced anomaly detection with the machine learning feature of Elasticsearch. Now, we will try to solve the same problem using a simple k-means clustering library function provided by scikit-learn. The purpose of k-means clustering is to divide the number of n data into k clusters, where each item of data belongs to a cluster with the nearest mean. We will use the data from the cf_rfem_hist_price index for analysis. For the single_metric_job machine learning job in the Create new machine learning job section, we only looked at the volume field. In this section, we will look at three fields: volume, changeOverTime, and changePercent. A small Python project, solving the classification problem using Elasticsearch and scikit-learn, can be downloaded from our GitHub repository, available at https://github.com/PacktPublishing/Mastering-Elasticsearch-7.0/tree/master/Chapter16/cf_rfem_hist_price). After downloading it, take a look at the cf_rfem_hist_price directory. We are extending and rewriting the same Python project from Chapter 12, Elasticsearch from Python Programming. The etl.py file—shown in the following code block—uses the low-level client to perform the search request. The query and the date histogram aggregation is reused to search the same set of data. Then, we only collect the data for the three fields and put it in a three-dimensional array, which is the returned content of the etl() function:

from com.example.client.config.low_level_client_by_connection import ESLowLevelClientByConnection
from elasticsearch_dsl import Search
from elasticsearch_dsl.query import Q, Bool, Range, Term
from elasticsearch_dsl.aggs import A, DateHistogram

def etl(index='cf_rfem_hist_price', start_date='2018-12-26', end_date='2019-03-25', symbol='rfem'):
    ESLowLevelClientByConnection.get_instance()
    search = Search(index=index, using='high_level_client')[0:100]
    search.query = Q(Bool(must=[Range(date={'gte': '2018-12-26', 'lte': '2019-03-25'}),                                Term(symbol='rfem')]))
    aggs = A(DateHistogram(field='date', interval='1d', format='yyyy-MM-dd', min_doc_count=1))        response = search.execute()
    hits = response['hits']
    hits=hits['hits']
    XX=[]
    for hit in hits:
        X=[]
        X.append(hit['_source']['changeOverTime'])
        X.append(hit['_source']['changePercent'])
        X.append(hit['_source']['volume'])
        XX.append(X)
    return(XX)

if __name__ == "__main__":
    XX=etl()
    for X in XX:
        print(X)

The kmeans.py file—shown in the following code block—uses the KMeans class for clustering. We instantiate a KMean object with the n_clusters=2 parameter to see whether we can divide the data into a normal group and an anomaly group. We read the data from the etl() function and then put it into the fit() function of the KMeans class. It generates a label for each data point: 0 for the first group and 1 for the second group. Then, we use the matplotlib.pyplot() function to plot the data with the label assigned. Let's take a look at the kmeans.py file, as shown in the following code block:

from etl import etl
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D

points = etl()
Kmean = KMeans(n_clusters=2)
Kmean.fit(points)
labels = Kmean.labels_
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
for item,label in zip(points, labels):
    if label == 0:
        ax.scatter(item[0], item[1], item[2], c='r', marker='o')
    else:
        ax.scatter(item[0], item[1], item[2], c='b', marker='s')

ax.set_xlabel('changeOverTime')
ax.set_ylabel('changePercent')
ax.set_zlabel('volume')
plt.show()

To run the program, follow the step-by-step instructions in this code block:

$ cd cf_rfem_hist_price/
$ source venv/bin/activate
(venv)$ export PYTHONPATH=.:$PYTHONPATH
(venv)$ python com/example/kmeans/kmeans.py

The program will generate a graph, as follows:

The dots represent the data labeled with 0, and the square represents the data labeled with 1. As our sample data is ideal for showing the anomaly, we don't need any further processes. If you are interested, you may want to expand your analysis to other fields such as changePercent. Moreover, you can write a plugin by using the code for practice. We will draw a present conclusion in the following section.

Table of Contents for Machine learning using Elasticsearch and scikit-learn

Create new playlist

Sign In

Sign Up

Table of Contents for
Machine learning using Elasticsearch and scikit-learn