Clustering for Data Reduction and Approximation

Computer Graphics & Geometry

Clustering for Data Reduction and Approximation

T. Schreiber
University of Kaiserslautern, Department of Computer Science

6750 Kaiserslautern, Germany

Tel.: +49 (631) 205 2887

FAX: +49 (631) 205 3270

Contents

Abstract
1. Introduction
2. Multidimensional Voronoi Diagrams
3. The Method of k-Means
4. The Proposed Method
5. Applications
References

Abstracts: This paper describes a solution to the following problem: Given a large set of arbitrarily distributed and weighted data points, find a much smaller set of cluster center points, which minimizes the sum of the squared distances between each data point and the nearest cluster center point. The weight of each data point may be chosen by the user depending on the importance or correctness of that point. An efficient clustering algorithm using local procedures on Voronoi diagrams is introduced. At the same time the method produces hierarchical Delaunay triangulations of the data points at different degrees of accuracy.

Key words: approximation; clustering algorithm; data reduction; Delaunay triangulation; hierarchical representation; Voronoi diagram.

1. Introduction

Clustering means grouping of similar objects by minimizing a certain criterion function or other object dependent properties. Clustering techniques are very common and useful in many applications, like data analysis, data reduction, digital image processing, pattern recognition and computer graphics. In the past many algorithms have been developed [1]. Heckbert [2] introduced the median-cut algorithm for color reduction in computer graphics which subdivided the RGB color space recursively by hyperplanes through the median and perpendicular to the coordinate axis with the greatest expansion. The mean-split algorithm from Wu and Witten [3] chooses the partitioning point to be the mean rather than the median. Wan, Wong and Prusinkiewicz [4] compute an optimal cut point for each axis, and subdivide the data set by a hyperplane perpendicular to that axis, on which the overall reduction of expected variance is the largest among all projected distributions. A comparison of these three algorithms is also given in [4]. The knot selection algorithm from Franke and McMahon [5],[6] uses the k-means method

which is reviewed in Section 3. They determine the cluster with the most and the least numbers of data points and move their cluster center points towards each other until all clusters contain approximately the same number of data points. Dierks [7] extended this method to three dimensions.

In this paper an algorithm is presented to solve the following problem: Given a set P of arbitrarily distributed data points P1,P2,...,PN with weights W1,W2,...,WN (Pi пїЅ Ed, 0 < Wi пїЅ пїЅ), in Euclidean space Ed, find a much smaller set Q of cluster center points q1,q2,...,qm (qj пїЅ Ed, m << N), which minimizes the sum of the squared distances between each data point and the nearest cluster center point.

Let n ( m << n пїЅ N ) denote the number of distinct input points p1,p2,...,pn. The new weights w1,w2,...,wn of p1,p2,...,pn are the sum of the weights Wi of equal points Pi (i =1,2,...,N).

Thus, the task is to determine the cluster center points qj and the index sets Ij that the error S is minimized with and the index sets

Ij = {i : pi is closer to qj , than to any other qk ,"i пїЅ k}

||.|| denotes the Euclidean norm. Because it is sufficient to minimize the sj , a necessary condition is that the partial derivations

The global minimum is achieved, if the cluster point qj lies in the weighted mean of all data points pi (i пїЅ Ij )

Formula (1) is true, because of the assumption, that all Wi > 0 (i =1,2,...,N).

The computation of the global minimum of S is a NP-complete problem [7]. There exist possibilities of grouping n data points to m cluster. Therefore the task is to compute the index sets for unknown cluster center point locations. For small data sets Koontz, Narendra and Fukunaga [8] compute the global minimum of S with a branch and bound algorithm. They reduce the number of grouping possibilities by lower and upper bound estimations of j.

Fig. 1. Voronoi diagram

All other methods reaches in general a local minimum of S in a more or less heuristic way. This study proposes a combination of an adaptive subdivision algorithm and a k-means iteration by a sequential insertion of cluster center points (see Section 3 and 4). The boundaries of sj form a multidimensional Voronoi diagram (see Section 2).

Thus, local procedures may be applied to increase the efficiency of the algorithm. Applications for using the cluster center points instead of the data points or the triangulation of the cluster center points for a surface approximation are given in Section 5.

2. Multidimensional Voronoi Diagrams

A multidimensional Voronoi diagram is a partitioning of the space Ed in regions Rj with the following properties: Each point qj lies in exactly one region Rj, which consists of all points x of Ed that are closer to qj than to any other point qk (j пїЅ k). The points j are called Voronoi points in this context

With this definition the index sets Ij can be defined as

Ij = {i : pi lies in region Rj}

Bowyer [10], Watson [11] and Palacios-Velez, Renaud [12] describe algorithms to compute multidimensional Voronoi diagrams by the sequential insertion of new points. The insertion and also the deletion or movement of a point are local procedures. That means that the change of one point affects only a small region of the diagram and can be computed independent of the number of Voronoi points, if the region is known and all points are scattered over the entire space. The dual of the Voronoi diagram is the Delaunay triangulation, which results from the Voronoi diagram by joining all Voronoi points, which have a common border, with straight lines. See Fig. 1 for the Voronoi diagram of a set of points and Fig. 2 for the dual Delaunay triangulation. For more details on Voronoi diagrams and triangulations see [13].

Fig. 2. Delaunau triangulation

3. The Method of k-Means

The method of k-means is an iterative procedure of moving an initial configuration of cluster center points into the centroids of their related data points. This ensures the global minimum of each sj for given cluster center point locations. If a cluster center point has been moved, some data points near the boundary may change the cluster. This causes a new iteration step because the pertaining cluster center points do not lie in the centroids any more, and therefore they have to be relocated. Selim and Ismail [14] show that this process converges after a finite number of iterations towards a local minimum of S until a quadratic metric is used.

Algorithm 1: k-Means

Step 1: Initialize a good starting cluster center point configuration q1,q2,...,qm with at least one data point for each cluster and compute the index sets Ij for j=1,2,...,m.

Step 2: For each cluster (j = 1,2,...,m) do: Move the cluster center point qj in the weighted mean of the data points pi (i пїЅIj) and update the index sets.

Step 3: Repeat Step 2 until no cluster has been changed.

This method produces good results, but the error depends heavily on the initial configuration. That means that other starting configurations cause other end configurations with different errors. Also the number of cluster center points is fixed. Thus, it is not possible to use as many cluster as necessary to yield a given error tolerance. A later change of cluster and/or data points could not efficiently be done, because the updates of the index sets require a great computational effort.

For each iteration, the distances between all data points and all cluster center points have to be calculated, if no data structure enables a limitation of the affected area. The Voronoi diagram structure reduces these calculations to the regions of all neighboring cluster.

4. The Proposed Method

In this section a Voronoi diagram based multidimensional clustering algorithm is described to find a minimum for S. The idea is based on an adaptive sequential insertion of a new cluster center point in the region with the largest error sj.

Algorithm 2: Adaptive Clustering

Step 1: Initialize the first cluster center point with the weighted mean of all data points. The corresponding region of the Voronoi diagram is the entire space.

Step 2: Divide the set of data points of the region Re with the largest error into two sets and determine the new index sets and their centroids.

First, compute the coordinate axis k with the largest projected variance where xl denotes the l-th component of vector x and qe the cluster center point of region Re. Then divide all points pi (iпїЅIe) by a hyperplane through qe perpendicular to the k-the coordinate axis. The new index sets Ie1, Ie2 and their cluster centers m1, m2 are determined as follows

Step 3: Update the Voronoi diagram:

Move the cluster center point qe into the center m1, insert a new cluster center point at m2 and update the index sets of the affected regions.

Step 4: For all modified regions:

Move the cluster center point into the weighted centroid of the related data points. Update the Voronoi diagram and the index sets.

Step 5: Repeat Steps 2-4 until the clustering requirement is met.

This requirement could be met, if a given number of cluster center points are inserted, and/or the mean or maximum error is lower than a given value and/or each cluster contains no more than a given number of data points.

See Fig. 3 and 4 for a 2 dimensional example of grouping 192 non weighted data points into 12 and 33 cluster. The dots 'пїЅ' denote the data point and the 'o' the cluster center point locations.

Fig. 3. 192 data points in 12 clustersFig. 4. 192 data points in 33 clusters

The author proposes the subdivision method described above, because it yields good results and does not require too much computational effort. But the user may employ any other method, especially, if he knows more about the distribution of the data. If rotational invariance is desired, a subdivision perpendicular to the best fit hyperplane could be adapted.

Notice, that a different criterion function can be used to determine the region that should be subdivided (step 2). Thus, it is possible to optimize a second function, like the demand of a nearly equal number of data points for each cluster. After each iteration the cluster center points and their Delaunay triangulation (or just their

changes) may be stored for future use. This leads to a triangular based hierarchical surface representation of the data points at different levels of accuracy, which may be used in computer graphic applications for the rendering of the same object at different resolutions.

5. Applications

The first example relates to a weighted data reduction. Let the data points be colors in RGB space and all weights Wi =1 (i =1,2,...,N). Thus, each wj is simply the number of pixels with the same color pj (j =1,2,...,n).

Fig. 8 shows a true color picture in 174370 different colors with a resolution of 546пїЅ430 pixels. The Fig. 5 to 7 show the reductions to 16, 64 and 256 colors respectively. Each original color is represented by the related cluster center point.

However, to yield better results, the weights may be chosen depending the physical properties of the human eye and/or an additional dithering may applied. But this was not our intention here.

Fig. 5. 16 colorsFig. 6. 64 colors

Fig. 7. 256 colorsFig. 8. True color picture

The next example proves the efficiency of using the Delaunay triangulation of the cluster center points for the approximation of the surface where the data points come from. Notice, that no additional computational effort is necessary. Given 1000 non weighted data points generated randomly from the test function

F(x,y) =

0.75exp [-0.25(9x-2)2 - 0.25(9y - 2)2]

+0.75exp [-(9x+1)2/49 - (9y + 1)/10]

+0.5exp [-0.25(9x-7)2 - 0.25(9y - 3)2]

0.2exp [-(9x-4)2 - (9y - 7)2]

over the domain [0,1]пїЅ[0,1] plus 4 points at the corners of the domain. These 4 additional points are added to each representation to yield a representation over the whole domain.

See Fig. 14 for the test function and Fig. 13 for the approximation by the triangulation of the data points.

Fig. 9. By triangulation of 25 pointsFig. 10. By triangulation of 50 points

Fig. 11. By triangulation of 100 points Fig. 12. By triangulation of 250 points

Fig. 13. By triangulation of data points Fig. 14. Test function

The Fig. 9 to 12 show the approximations which are generated by the triangulation of 25, 50, 100 and 250 cluster center points respectively. Each function value is computed by a linear interpolation over the vertices of the pertaining triangle. The behavior of the function is already visible using a small amount of cluster center points. The lack of accuracy at the boundaries results from long and small triangles which are forced by the corner points. However, higher order triangular interpolants may reach lower approximation errors. The additional degrees of freedom may be used to improve the approximation and/or the continuity of the resulting surface.

References

[1] J.A.Hartigan, 'Clustering Algorithms', Wiley, New York, 1975.

[2] P.Heckbert, 'Color image quantization for frame buffer display', ACM Trans. Computer Graphics, 16, (3), 297-304, (1982).

[3] X.Wu and I.H.Witten, 'A fast k-means type clustering algorithm', Dept. Computer Science, Univ. of Calgary, Canada, May 1985.

[4] S.J.Wan, S.K.M.Wong and P.Prusinkiewicz, 'An Algorithm for Multidimensional Data Clustering', ACM Trans. on Math. Soft., 14, (2), 153-162, (1988).

[5] J.R.McMahon, 'Knot Selection for Least Squares Approximation using Thin Plate Splines', Mastes Thesis, Naval Postgraduate School, Monterey, USA, June 1986.

[6] J.R.McMahon and R.Franke, 'An Enhanced Knot Selection Algorithm for Least Squares Approximation using Thin Plate Splines', Trans. of the Seventh Army, Conf. on Applied Math. And Computing, ARO Report 90-1.

[7] T.Dierks, 'The Modelling and Visualization of Scattered Volumetric Data', Masters Thesis, Arizona State University, USA, Dec. 1990.

[8] I.Hyafil and R.L.Rivest, 'Construction optimal binary decision trees is NP-complete', Inf. Process. Lett., (5), 15-17, (1976).

[9] W.L.G.Koontz, P.M.Narendra and K.Fukunaga, 'A Branch and Bound Clustering Algorithm', IEEE Trans. on Comp., c-24, (9), (1975).

[10] A.Bowyer, 'Computing Dirichlet tesselations', Comp. Journal, 24, (2), 162-166, (1981).

[11] D.F.Watson, 'Computing the n-dimensional Delaunay tesselation with application to Voronoi polytops', Comp. Journal, 24, (2), 162-166, (1981).

[12] O.Palacios-Velez and B.C.Renaud, 'A Dynamic Hierarchical Subdivision Algorithm for Computing Delaunay Triangulations and Other Closest-Point Problems', ACM Trans. on Math. Soft., 16, (3), 275-292, (1990).

[13] F.P.Preparata and M.I.Shamos, Computational Geometry, Springer, 1985.

[14] S.Z.Selim and M.A.Ismail, 'K-means-type algorithms: A generalized convergence theorem and characterization of local optimality', IEEE Trans. Pattern Anal. Mach. Intell., PAMI-6, (1), 81-87, (1986).