Getting ready

We will use TensorFlow KmeansClustering Estimator class to implement k-means. It is defined in https://github.com/tensorflow/tensorflow/blob/r1.3/tensorflow/contrib/learn/python/learn/estimators/kmeans.py. It creates a model to run k-means and inference. According to the TensorFlow docs, once a KmeansClustering class object is created, it is instantiated with the following __init__ method:

__init__(
num_clusters,
model_dir=None,
initial_clusters=RANDOM_INIT,
distance_metric=SQUARED_EUCLIDEAN_DISTANCE,
random_seed=0,
use_mini_batch=True,
mini_batch_steps_per_iteration=1,
kmeans_plus_plus_num_retries=2,
relative_tolerance=None,
config=None
)

TensorFlow Docs define these arguments as follows:

Args:
num_clusters: The number of clusters to train.
model_dir: The directory to save the model results and log files.
initial_clusters: Specifies how to initialize the clusters for training. See clustering_ops.kmeans for the possible values.
distance_metric: The distance metric used for clustering. See clustering_ops.kmeans for the possible values.
random_seed: Python integer. Seed for PRNG used to initialize centers.
use_mini_batch: If true, use the mini-batch k-means algorithm. Or else assume full batch.
mini_batch_steps_per_iteration: The number of steps after which the updated cluster centers are synced back to a master copy. See clustering_ops.py for more details.
kmeans_plus_plus_num_retries: For each point that is sampled during kmeans++ initialization, this parameter specifies the number of additional points to draw from the current distribution before selecting the best. If a negative value is specified, a heuristic is used to sample O(log(num_to_sample)) additional points.
relative_tolerance: A relative tolerance of change in the loss between iterations. Stops learning if the loss changes less than this amount. Note that this may not work correctly if use_mini_batch=True.
config: See Estimator.

TensorFlow supports Euclidean distance and cosine distance as the measure of the centroid. TensorFlow KmeansClustering provides various methods to interact with the KmeansClustering object. In the present recipe, we will use fit(), clusters(), and predict_clusters_idx() methods:

fit(
 x=None,
 y=None,
 input_fn=None,
 steps=None,
 batch_size=None,
 monitors=None,
 max_steps=None
)

According to TensorFlow docs, for KmeansClustering Estimator, we need to provide input_fn() to fit(). The cluster method returns cluster centres and the predict_cluster_idx method returns predicted cluster indices.

Table of Contents for Getting ready

Create new playlist

Sign In

Sign Up

Table of Contents for
Getting ready