Clustering web pages and OPTICS

Ordering points to identify the clustering structure, OPTICS, extends the DBSCAN algorithm and is based on the phenomenon that density-based clusters, with respect to a higher density, are completely contained in density-connected sets with respect to lower density.

To construct density-based clusters with different densities simultaneously, the objects are dealt with in a specific order when expanding a cluster, that is, according to the order, an object that is density-reachable with respect to the lowest Clustering web pages and OPTICS for the higher density clusters will be finished first.

Two major concepts are introduced to illustrate the OPTICS algorithm: core-distance of an object, p, and reachability-distance of object, p. An example is illustrated in the following image. During which, core-distance (o), reachability-distances, Clustering web pages and OPTICS,Clustering web pages and OPTICS, for MinPts=4.

Clustering web pages and OPTICS

The core-distance of an object, p, is denoted as the minimal value, Clustering web pages and OPTICS, considering that the Clustering web pages and OPTICS-neighborhood at least contains MinPts data objects; otherwise, this is undefined.

Given two data objects, p and q, the reachability-distance of object, p, from q represents the smallest radius value that makes p density-reachable from q if q is a core data object; otherwise, this is undefined.

OPTICS outputs an ordering of all data objects in a given dataset in which the data objects are processed. For each object, the core-distance and a suitable reachability-distance for each data object are calculated and output together.

The OPTICS algorithm

OPTICS randomly selects an object from the input dataset as the current data object, p. In addition, for the object, p, the The OPTICS algorithm-neighborhood is fetched, the core-distance is calculated, and the reachability-distance is undefined.

Given p as a core data object, OPTICS recalculates the reachability-distance for any object, q (in the neighborhood of p), from p and inserts q into OrderSeeds if q has not been processed.

Once the augmented cluster ordering of the input dataset is generated with respect to The OPTICS algorithm, MinPts and a clustering-distance The OPTICS algorithm, the density-based clustering is performed with this order.

The summarized pseudocode for the OPTICS algorithm with a couple of supporter functions are as follows:

The OPTICS algorithm
The OPTICS algorithm
The OPTICS algorithm
The OPTICS algorithm

The R implementation

Please take a look at the R codes file ch_06_optics.R from the bundle of R codes for the previously mentioned algorithm. The codes can be tested with the following command:

> source("ch_06_optics.R")

Clustering web pages

Clustering web pages can be used to group related texts/articles and serve as the preprocessing steps for supervised learning. It enables automated categorization.

Web pages are versatile and have a different structure (well-formed or badly formed without structure) and contents.

The Yahoo! industry web page data (CMUWeb KBProject, 1998) is used as the test data here.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.171.107