An extensive comparative study of cluster validity indices
Olatz Arbelaitz, Ibai Gurrutxagan, Javier Muguerza, Jesu´
Department of Computer Architecture and Technology, University of the Basque Country UPV/EHU, Manuel Lardizabal 1, 20018 Donostia, Spain
Received 3 February 2012
Received in revised form
26 July 2012
Accepted 31 July 2012
Available online 9 August 2012
Cluster validity index
The validation of the results obtained by clustering algorithms is a fundamental part of the clustering
process. The most used approaches for cluster validation are based on internal cluster validity indices.
Although many indices have been proposed, there is no recent extensive comparative study of their
performance. In this paper we show the results of an experimental work that compares 30 cluster
validity indices in many different environments with different characteristics. These results can serve
as a guideline for selecting the most suitable index for each possible application and provide a deep
insight into the performance differences between the currently available indices.
&2012 Elsevier Ltd. All rights reserved.
Clustering is an unsupervised pattern classi?cation method
that partitions the input space into clusters. The goal of a
clustering algorithm is to perform a partition where objects
within a cluster are similar and objects in different clusters are
dissimilar. Therefore, the purpose of clustering is to identify
natural structures in a dataset1–4 and it is widely used in
many ?elds such as psychology5, biology4, pattern recogni-
tion3, image processing6and computer security7.
Once a clustering algorithm has processed a dataset and
obtained a partition of the input data, a relevant question arises:
How well does the proposed partition ?t the input data? This
question is relevant for two main reasons. First, an optimal
clustering algorithm does not exist. In other words, different
algorithms—or even different con?gurations of the same algo-
rithm—produce different partitions and none of them have
proved to be the best in all situations8. Thus, in an effective
clustering process we should compute different partitions and
select the one that best ?ts the data. Secondly, many clustering
algorithms are not able to determine the number of natural
clusters in the data, and therefore they must initially be supplied
with this information—frequently known as thekparameter.
Since this information is rarely previously known, the usual
approach is to run the algorithm several times with a different
kvalue for each run. Then, all the partitions are evaluated and the
partition that best ?ts the data is selected. The process ofestimating how well a partition ?ts the structure underlying the
data is known as cluster validation1.
Cluster validation is a dif?cult task and lacks the theoretical
background other areas, such as supervised learning, have. More-
over, a recent work argues the suitability of context-dependent
evaluation methods9. Nevertheless, the authors also state that the
analysis of cluster validation techniques is a valid research question
in some contexts, such as clustering algorithms’ optimization.
Moreover, in our opinion, cluster validation tools analyzed in
context-independent evaluations will greatly contribute to context-
dependent evaluation strategies. Therefore, our work is based on a
general, context-independent cluster evaluation process.
In this context, it is usual to classify the cluster validation
techniques into three groups—internal, external and relative
validation—but the classi?cation criteria are not always clear
10,1,2,11. In any case, there is a clear distinction between valida-
tion techniques if we focus on the information available in the
validation process. Some techniques—related to external validation
—validate a partition by comparing it with the correct partition.
Other techniques—related to internal validation—validate a
partition by examining just the partitioned data. Obviously, the
former can only make sense in a controlled test environment, since
in a real application the underlying structure of the data is unknown
and, therefore, the correct partition is not available.
When the correct partition is available the usual approach is to
compare it with the partition proposed by the clustering algo-
rithm based on one of the many indices that compare data
partitions; e.g. Rand, Adjusted Rand, Jaccard, Fowlkes–Mallows,
Variation of Information12.
On the other hand, when the correct partition is not available
there are several approaches to validating a partition. One of
them is to focus on the partitioned data and to measure the
Contents lists available atSciVerse ScienceDirect
0031-3203/$ – see front matter;2012 Elsevier Ltd. All rights reserved.
nCorresponding author. Tel.:þ34 943015166; fax:þ34 943015590.
E-mail addresses:[email protected] (O. Arbelaitz),
[email protected] (I. Gurrutxaga), [email protected] (J. Muguerza),
[email protected] (J.M. Pe´
rez), [email protected] (I. Perona).
Pattern Recognition 46 (2013) 243–256
compactness and separation of the clusters. In this case another
type of index is used; e.g. Dunn13, Davies–Bouldin14,
Calinski–Harabasz15. Another more recent approach is the
stability based validation16,17 which is not model dependant
and does not require any assumption of compactness. This
approach does not directly validate a partition, but it relies on
the stability of the clustering algorithm over different samples of
the input dataset.
The differences of the mentioned validation approaches make it
dif?cult to compare all of them in the same framework. This work
focuses on the ?rst approach mentioned, which directly estimates
the quality of a partition by measuring the compactness and
separation of the clusters. Although there is no standard terminol-
ogy, in the remainder of this paper we will callCluster Validity Index
(CVI) to these kind of indices. For the indices that compare two
partitions we will use the termPartition Similarity Measure.
Previous works have shown that there is no single CVI that
outperforms the rest18–20. This is not surprising since the same
occurs in many other areas and this is why we usually deal with
multiple clustering algorithms, partition similarity measures, clas-
si?cation algorithms, validation techniques, etc. This makes it
obvious that researchers and practitioners need some guidelines
on which particular tool they should use in each environment.
Focusing on internal cluster validation, we can ?nd some
works that compare a set of CVIs and, therefore, these could be
used as guidelines for selecting the most suitable CVI in each
environment. However, most of these comparisons are related to
the proposal of a new CVI6,21–24 or variants of known CVIs
25,8,26 and, unfortunately, the experiments are usually per-
formed in restricted environments—few CVIs compared on few
datasets, just one clustering algorithm implied. There are few
works that do not propose a new CVI but compare some of them
in order to draw some general conclusions10,18,27,20. Surpris-
ingly, the 25 year-old paper of Milligan and Cooper20is the
work most cited as a CVI comparison reference. Certainly, to the
best of our knowledge, nobody has since published such an
extensive and systematic comparative study.
In this paper we present the results of an extensive CVI
comparison along the same lines as Milligan and Cooper20,
which is the last work that compares a set of 30 CVIs based on the
results obtained in hundreds of environments. We claim that we
have improved the referenced work in three main areas. First, we
can compare many new indices that did not exist in 1985 and
discard those that have rarely been used since. Second, we can
take advantage of the increases in computational power achieved
in recent decades to carry out a wider experiment. Finally, thanks
to the advances in communication technologies we can easily
store all the detailed results available in electronic format, so that
every reader can access them and focus on the results that are
relevant in his/her particular environment.
Moreover, our work is based on a corrected methodology that
avoids an incorrect assumption made by the usual CVI compar-
ison methodology28. Therefore, we present two main contribu-
tions in this paper. First, we present the main results of the most
extensive CVI comparison ever carried out. Second, this compar-
ison is the ?rst extensive CVI comparison carried out with the
methodological correction proposed by Gurrutxaga et al.
Moreover, although the experiment’s size prevents us from
publishing all the results in this paper, they are all available in
electronic format in the web.
The next section discusses other works related to CVI compar-
ison.Section 3describes all the cluster validity indices compared
in this work andSection 4describes the particular details of theexperimental design. InSection 5we show the main results of the
work and, ?nally, we draw some conclusions and suggest some
possible extensions inSection 6
2. Related work
Most of the works that compare CVIs use the same approach:
A set of CVIs is used to estimate the number of clusters in a set of
datasets partitioned by several algorithms. The number of suc-
cesses of each CVI in the experiment can be called its score and is
considered an estimator of its ”quality”. For a more formal
description of this methodology and a possible alternative to it
Despite this widely used approach, most of the works are not
comparable since they differ in the CVIs compared, datasets used,
results analysis. In this section we overview some of the works
that compare a set of CVIs, focusing on the experiment
The paper published by Milligan and Cooper20in 1985 is
still the work of reference on internal cluster validation. That
work compared 30 CVIs. The authors called them ”Stopping
criteria” because they were used to stop the agglomerative
process of a hierarchical clustering algorithm2,4 and this is
why the experiments were done with hierarchical clustering
algorithms (single-linkage, complete-linkage, average-linkage
and Ward). They used 108 synthetic datasets with a varying
number of non-overlapped clusters (2, 3, 4 or 5), dimensionality
(4, 6 or 8) and cluster sizes. They presented the results in a tabular
format, showing the number of times that each CVI predicted the
correct number of clusters. Moreover, the tables also included the
number of times that the prediction of each CVI overestimated or
underestimated the real number of clusters by 1 or 2.
The same tabular format was used by Dubes27two years
later. The novelty of this work is that the author used some tables
where the score of each CVI was shown according to the values of
each experimental factor—clustering algorithm, dataset dimen-
sionality, number of clusters. Moreover, he used the
test the effect of each factor on the behaviour of the compared
CVIs. Certainly, the use of statistical tests to validate the experi-
mental results is not common practice in clustering, as opposed to
other areas such as supervised learning. The main drawback of
this work is that it compares just 2 CVIs (Davies–Bouldin and the
modi?ed Hubert statistic). The experiment is performed in
2 parallel works of 32 and 64 synthetic datasets, 3 clustering
algorithms (single-linkage, complete-linkage and CLUSTER) and
100 runs. The datasets’ characteristics were controlled in the
generation process and they used different sizes (50 or 100
objects), dimensionality (2, 3, 4 or 5), number of clusters (2, 4,
6 or 8), sampling window (cubic or spherical) and cluster overlap.
In 1997, Bezdek et al.29published a paper comparing 23 CVIs
based on 3 runs of the EM algorithm and 12 synthetic datasets. The
datasets were formed by 3 or 6 Gaussian clusters and the results
were presented in tables that showed the successes of every CVI on
each dataset. Another work that compared 15 CVIs was performed
by Dimitriadou et al.18based on 100 runs of k-means and hard
competitive learning algorithms. The 162 datasets used in this
work were composed of binary attributes which made the experi-
ment and the results presentation somewhat different to the
previously mentioned ones.
More recently, Brun et al.10compared 8 CVIs using several
clustering algorithms: k-means, fuzzy c-means, SOM, single-
linkage, complete-linkage and EM. They used 600 synthetic
datasets based on 6 models with varying dimensionality
(2 or 10), cluster shape (spherical or Gaussian) and number of
clusters (2 or 4). The novelty in this work can be found in the
1http://www.sc.ehu.es/aldapa/cvi.O. Arbelaitz et al. / Pattern Recognition 46 (2013) 243–256 244
comparison methodology. The authors compared the partitions
obtained by the clustering algorithms with the correct partitions
and computed an error value for each partition. Then, the
”quality” of the CVI is measured as its correlation with the
measured error values. In this work, not just internal but also
external and relative indices are examined. The results show that
the Rand index is highly correlated with the error measure.
The mentioned correlation between the error measure and the
Rand index makes one think about the adequacy of the error as a
de?nitive measure. In the recent work of Gurrutxaga et al.28the
authors accepted that there is no single way of establishing the
quality of a partition and they proposed using one of the external
indices available—or even better, several of them. This is the ?rst
work that clearly confronted a methodological drawback ignored by
many authors, but noticed by others10,22,23,20. Since the main
goal of this work was to present a modi?cation of the traditional
methodology, they compared just 7 CVIs based on 7 synthetic and
3 real datasets and 10 runs of the k-means algorithm.
Other CVI comparisons can be found where new CVIs are
proposed, but in this case the experiment is usually limited. It is
common to ?nd works comparing 5 or 10 CVIs on a similar
number of datasets6,21,22,25,8,26,24.
3. Cluster validity indices
In this section we describe the 30 CVIs compared in this work.
First, to simplify and reduce the CVI description section we de?ne
the general notation used in this paper and particular notations
used to describe several indices.
Let us de?ne a datasetXas a set ofNobjects represented
as vectors in anF-dimensional space:X¼fx
A partition or clustering inXis a set of disjoint clusters that
kcl¼|8kal. The centroid of a clusterckis its mean vector,
ck¼1=9ck9PxiACkxiand, similarly, the centroid of the dataset is
the mean vector of the whole dataset,
We will denote the Euclidean distance between objectsx
jas deðxi,xjÞ. We de?ne the Point Symmetry-Distance30
between the objectx
iand the clusterckas
The point 2
ckxiis called the symmetric point ofxiwith respect
to the centroid ofc
k. The functionP
min can be seen as a
variation of the min function whereP
minðnÞcomputes the sum
of thenlowest values of its argument. Similarly, we can de?ne the
max function as an analogue variation of the max function.
Finally, let us de?nen
wsince it is used by several indices.nwis
the number of object pairs in a partition that are in the same
3.2. Index de?nitions
Next, we describe the 30 CVIs compared in this work. We
focused on CVIs that can be easily evaluated by the usual
methodologies and avoided those that could lead to confusion
due to the need for a subjective decision by the experimenter.
Therefore, we have discarded some indices that needed to deter-
mine a ”knee” in a plot—such as the Modi?ed Hubert index31
—need to tune a parameter or need some kind of normalization—
such as thev
SVindex32or the Jump index33.Wehave
also avoided fuzzy indices, since our goal was to focus oncrisp clustering. In brief, we focused on crisp CVIs that allow
selection of the best partition based on its lowest or highest value.
Most of the indices estimate the cluster cohesion (within or
intra-variance) and the cluster separation (between or inter-
variance) and combine them to compute a quality measure.
The combination is performed by a division (ratio-type indices)
or a sum (summation-type indices)25.
For each index we de?ne an abbreviation that will be helpful
in the results section. Moreover, we accompanied each abbrevia-
tion with an up or down arrow. The down arrow denotes that a
lower value of that index means a ”better” partition. The up arrow
means exactly the opposite.
Dunn index (Dm)13: This index has many variants and
some of them will be described next. It is a ratio-type index
where the cohesion is estimated by the nearest neighbour
distance and the separation by the maximum cluster dia-
meter. The original index is de?ned as
Calinski–Harabasz (CHm)15: This index obtained the best
results in the work of Milligan and Cooper20. It is a ratio-
type index where the cohesion is estimated based on the
distances from the points in a cluster to its centroid.
The separation is based on the distance from the centroids to
the global centroid, as de?ned inSection 3.1. It can be de?ned as
Gamma index (Gk)34: The Gamma index is an adaptation
of Goodman and Kruskal’s Gamma index and can be
i,xjÞdenotes the number of all object pairs inX,
kandxl, that ful?l two conditions: (a)xkandxlare in
different clusters, and (b) d
eðxk,xlÞodeðxi,xjÞ. In this case the
denominator is just a normalization factor.
C-Index (CIk)35: This index is a type of normalized
cohesion estimator and is de?ned as
Davies–Bouldin index (DBk)14: This is probably one of the
most used indices in CVI comparison studies. It estimates the
cohesion based on the distance from the points in a cluster to
O. Arbelaitz et al. / Pattern Recognition 46 (2013) 243–256245
its centroid and the separation based on the distance between
centroids. It is de?ned as
Silhouette index (Silm)36: This index is a normalized
summation-type index. The cohesion is measured based on
the distance between all the points in the same cluster and
the separation is based on the nearest neighbour distance.
It is de?ned as
4. Experimental setup
In this section we describe the experiment performed to
compare the CVIs listed in the previous section. As shown in
Section 2, there are many possible experimental designs for such
a comparison. Since we want to compare the CVIs in a wide
variety of con?gurations we designed an experiment with several
factors. Unfortunately, due to combinatorial explosion we had to
limit each factor to just a few levels and this ?nally led us to an
experiment with 6480 con?gurations.
The comparative methodology that we used is a variation of
the traditional problem of estimating the number of clusters of a
dataset. The usual approach is to run a clustering algorithm over a
dataset with a set of different values for thekparameter—the
number of clusters of the computed partition—obtaining a set of
different partitions. Then, the evaluated CVI is computed for all
the partitions. The number of clusters in the partition obtaining
the best results is considered the prediction of the CVI for that
particular dataset. If this prediction matches the true number of
clusters, the prediction is considered successful.
The variation we used modi?es the problem so that the CVIs
are not used to estimate the correct number of clusters. They are
used to predict which is the ”best” partition in the mentioned set
of partitions. The ”best” partition is de?ned as the one that is the
most similar to the correct one—measured by a partition simi-
larity measure—which is not always the one with the correct
number of clusters. For a formal and more detailed description
see28. In order to avoid the possible bias introduced by the
selection of a particular partition similarity measure, we repli-
cated all the experiments using three partition similarity mea-
sures: Adjusted Rand31, Jaccard41and Variation of
We used three clustering algorithms to compute partitions
from the datasets: k-means, Ward and average-linkage2. These
are well known and it is easy to obtain different partitions by
modifying the parameter that controls the number of clusters of
the output partition. Each algorithm was used to compute a set of
partitions with the number of clusters ranging from 2 to????
whereNis the number of objects in the dataset. In the case of the
real datasets, the number of clusters in a partition was limited to
25 to avoid computational problems with large datasets.
As usual, we used several synthetically generated datasets for
the CVI evaluation. Furthermore, we also compared them using 20
real datasets drawn from the UCI repository43. In any case, it is
important to note that results based on real datasets should be
analyzed with caution since these datasets are usually intended to
be used with supervised learning and, therefore, they are not
always well adapted to the clustering problem9. On the
contrary, the synthetic datasets avoid many problems found with
real datasets. For instance, in synthetic datasets categories exists
independent of human experience and their characteristics can be
easily controlled by the experiment designer.
The synthetic datasets were created to cover all the possible
combinations of ?ve factors: number of clusters (K), dimension-
ality (dim), cluster overlap (ov), cluster density (den) and noise
level (nl). We de?ned two types of overlap:strict, meaning that
ovoverlap level must be exactly satis?ed, andbounded,
meaning thatovis the maximum allowed overlap.
A ?xed hypercubic sampling window is de?ned to create all
the synthetic datasets. The window is de?ned by theð0,0,…,0Þ
andð50,50,…,50Þcoordinates. In a similar way, a reduced sampling
O. Arbelaitz et al. / Pattern Recognition 46 (2013) 243–256247
window is de?ned by theð3,3,…,3Þandð47,47,…,47Þcoordi-
nates. Then, the centre for the ?rst cluster,c
0, is randomly drawn
in the reduced sampling window based on a uniform distribution.
The ?rst cluster is created by randomly drawingn
following a multivariate normal distribution ofdimdimensions
0and the identity as covariance matrix. All points
located outside the sampling window are removed and new
points are drawn to replace them.
The remaining clusters will haven
minpoints and this produces
a density asymmetry whendena1. This occurs because a
different number of points will be located in the same approx-
In particular, we build the remainingK1 clusters as follows:
if the overlap isbounded, the centre of the cluster,c
i, is drawn
uniformly from the reduced sampling window. Otherwise, a
previously created cluster centre,c
k, is randomly selected and
the new cluster centre,c
i, is set to a random point located at a
distance of 2ovfromc
k. In any case, if deðci,clÞo2ov8clacithe cluster centre is discarded and a new one is selected. Once the
cluster centre has been de?ned the cluster is built by drawingn
minpoints in the same way as we did for the ?rst cluster.
Finally, when all the clusters have been built,nlN0points are
randomly created following a uniform distribution in the sam-
pling window, whereN
0is the number of non-noise points in the
The values of the parameters used to create the synthetic
datasets are shown inTable 1, making 72 different con?gurations.
As we created 10 datasets from each con?guration we used 720
synthetic datasets. Multiplying this value by three partition
similarity measures and three clustering algorithms we obtain
the 6480 con?gurations previously mentioned. Notice that the
minparameter ensures that every cluster is composed of at least
Fig. 1shows an example of 4 two-dimensional datasets we
have used. In the ?gure we can see how the different values of the
generation parameters affects the point distribution in the data-
sets.Fig. 1a shows a dataset with four clusters, with no cluster
overlap, no noise and no density asymmetry. The other three plots
show dataset with similar characteristics except for overlap,
density and noise parameters.
The 20 real datasets and their main characteristics are shown
inTable 2. In this case the experiment is based on 180
con?gurations—20 datasets, 3 algorithms and 3 partition simi-
Including synthetic and real datasets, and taking into account
the different number of partitions computed for each dataset,
each of the 30 CVIs was computed for 156 069 partitions.
One of the goals of this work is to present the results in such a
way that readers can focus on the particular con?gurations they
are interested in. However, the vast amount of results obtainedprohibits all of them being shown in this paper. Therefore, we
focus here on the overall results; drawing some important
conclusions. However, all the detailed results are available in
In this section we ?rst describe the results obtained for the
synthetic datasets and then the results for the real datasets are
described. Finally, we present a brief discussion on the use of
statistical analysis in clustering and we show the conclusions we
drew by applying some statistical tests to the results.
5.1. Synthetic datasets
The overall results for the synthetic datasets are shown in
Fig. 2. The ?gure shows the percentage of correct guesses
Values of the parameters used in the synthetic
dataset generation step.
K2, 4, 8
dim2, 4, 8
ov1.5 (strict), 5 (bounded)
Fig. 1.Two-dimensional plots of four synthetic datasets used in the experiment.
(a) Shows a ”neutral” dataset with no cluster overlap, no density asymmetry and
no noise. (b) Shows a similar dataset with high cluster overlap. (c) Shows a dataset
with cluster density asymmetry. (d) Shows a dataset with noise.
The characteristics of the real datasets drawn from the UCI repository.
Dataset No. of objects Features Classes
Breast tissue 106 9 6
Breast Wisconsin 569 30 2
Ecoli 336 7 8
Glass 214 9 7
Haberman 306 3 2
Ionosphere 351 34 2
Iris 150 4 3
Movement libras 360 90 15
Musk 476 166 2
Parkinsons 195 22 2
Segmentation 2310 19 7
Sonar all 208 60 2
Spectf 267 44 2
Transfusion 748 4 2
Vehicle 846 18 4
Vertebral column 310 6 3
Vowel context 990 10 11
Wine 178 13 3
Winequality red 1599 11 6
Yeast 1484 8 10 O. Arbelaitz et al. / Pattern Recognition 46 (2013) 243–256 248
(successes) achieved by each CVI, which are sorted by the number
of successes. Notice that this percentage refers to the 6480
con?gurations. The graph shows that Silhouette achieves the best
overall results and is the only one that exceeds the 50% score. DB
and CH also show a good result, with a success rate beyond 45%.
It is also noticeable that in most cases variations of a CVI
behave quite similarly; they appear in contiguous positions in the
?gure. The clearest cases are the generalized Dunn indices that
D3as cohesion estimator—gD33, gD43 and gD53—and the
graph theory based Dunn indices—DMST,DRNGand DGG.
Next we will show a similar graph for each experimental
factor. In this case the value of each CVI is shown for each value of
the analyzed factor. We will keep the CVI order shown inFig. 2,so
a decreasing graph will denote that the analyzed factor does not
change the overall ranking.
First of all, let us focus on the graph corresponding to the
partition similarity measure. Remind that this is a parameter of
the validation methodology we have used (seeSection 4). InFig. 3
we can see that the selected partition similarity measure does notaffect the results. This result suggests that the CVI comparison is
not affected by the particular selection of a parameter of the
evaluation methodology and, therefore, we can be con?dent of
the results. Also notice that although Adjusted Rand and Jaccard
show very similar results the use of the VI partition similarity
measure produces slightly higher success rates.
In the following ?gures a similar breakdown can be found with
regard to the characteristics of the datasets. InFig. 4we can see
how the number of clusters of the datasets affects the results. As
expected, all the CVIs obtain better results with fewer
clusters—average result fork¼2 drops from 50.2% to 30.7%
(k¼4) and 24.8% (k¼8). We can also see that for high values of
this parameter the differences between the CVIs are reduced.
Furthermore, some indices, such as COP, show little sensitivity to
this parameter making it the best CVI fork¼8.
With respect to dimensionality (seeFig. 5), the results show
that the dif?culty imposed by an increment in the number of
dimensions does not severely affect the behaviour of the
CVIs—except for NI. Moreover, some indices, such as Sym, show
Success rate (%)
Fig. 2.Overall results for the experiment with synthetic datasets.
Success rate (%)
Fig. 3.Results for synthetic datasets broken down by partition similarity measure.O. Arbelaitz et al. / Pattern Recognition 46 (2013) 243–256249
a better behaviour for datasets with higher dimensionality.
Silhouette also achieves the best results for every value analyzed
for this parameter.
Let us now focus on the results shown inFig. 6. This graph
shows that, as expected, datasets with no overlapping clusters
lead to better CVI success rates. The average result decreases from
52.9% to 17.6% when well separated clusters are replaced by
overlapped clusters. The graph also shows that although this
parameter does not severely affect the overall trend, some CVIs
are more hardly affected by cluster overlap, e.g. DB, COP and
SymDB. Some others, such as G, CI and OS, seem not to work at all
when clusters overlap.
With respect to the density of the clusters,Fig. 7shows that
having a cluster four times denser than the others, does not
severely affect the CVIs. It seems that the best behaving indices
are quite insensitive to this parameter while the rest show a
better result when density heterogeneity is present. Silhouette is
again, clearly, the CVI showing best results.Noise level, the last dataset characteristic analyzed in this
work, has a major impact on the scores of the CVIs (Fig. 8). In fact,
the scores in noisy environments are on average three times
lower than they are when no noise is present. Silhouette, and
mostly SDbw, are the main exception to this rule since they show
similar score values for noisy and noiseless environments.
Besides, the overall trend is not always followed and CH is the
CVI that achieves the best results when no noise is present.
Finally,Fig. 9shows how the clustering algorithm used in the
experiment affects the scores of the indices evaluated. Although
we cannot ?nd a clear pattern, it seems that the overall compara-
tive results are not severely affected since the decreasing pattern
of the graph is somehow maintained. Most of the CVIs obtain
their worst results for the k-means algorithm, but there are some
exceptions where the opposite holds—COP, G, CI and OS are the
most remarkable examples. Silhouette is again the one achieving
the best results for hierarchical algorithms, but CH is the best CVI
when k-means is used as clustering algorithm.
Success rate (%)
Fig. 4.Results for synthetic datasets broken down by number of clusters.
Success rate (%)
Fig. 5.Results for synthetic datasets broken down by dimensionality. O. Arbelaitz et al. / Pattern Recognition 46 (2013) 243–256 250
5.2. Real datasets
In this section we show the results obtained for 20 real
datasets following a similar style to the one we used for synthetic
datasets. Obviously, since we do not have control over the dataset
design the number of experimental factors is reduced to 2:
partition similarity measure and clustering algorithms.
First, inFig. 10we show the overall results for real datasets. A
quick comparison to the overall results for the synthetic datasets
(Fig. 2) shows that the results are qualitatively similar. Most of
CVIs that obtained worst results with synthetic datasets are also
in the tail of the ranking in the ?gure for real datasets. Focusing
on the head of the ranking we can see that the generalized Dunn
indices—gD33, gD43 and gD53—remain in a similar position;
SF, graph theory based Dunn and COP improve their position; and
nand CH go down the ranking. Considering these
results we can say that the mentioned generalizations of the
Dunn index show the steadiest results.
Returning to the two experimental factors involved in the
experiments with real datasets, inFig. 11we show the resultsbroken down by partition similarity measure. We can see that in
this case it seems that the partition similarity measure selected
can affect the results. Although Jaccard and VI follow the overall
pattern the Adjusted Rand index does not. Furthermore, it is clear
that in every case the average scores are much lower when
Adjusted Rand is used, dropping from 39.1% (VI) or 31.1%
(Jaccard) to 10.0%.
With regard to the clustering algorithm used (seeFig. 12) the
results are contradictory. On the one hand, if we focus on k-means
and Ward, it seems that this factor does not severely affect the
results. On the other hand, results for average-linkage reduce the
differences between CVIs and do not follow the overall results. In
this case, Sym shows the best results while SF achieves the
highest success rates for k-means and Ward.
5.3. Statistical tests
Although the assessment of the experimental results using
statistical tests is a widely studied technique in machine learning,
it is rarely used in the clustering area. Among the works cited in
Success rate (%)
Fig. 6.Results for synthetic datasets broken down by cluster overlap.
Success rate (%)
Fig. 7.Results for synthetic datasets broken down by density. O. Arbelaitz et al. / Pattern Recognition 46 (2013) 243–256251
Section 2just Dubes27used a statistical test to assess the
in?uence of each experimental factor on the results obtained.
However, in our case we focused on checking whether the
observed differences between CVIs were statistically signi?cant
We argue that an effort should be made by the clustering and
statistics communities to adapt these tools to clustering and
effectively introduce them in the area. These types of tests would
be even more important in extensive comparative works such as
the one described in this paper. Therefore, although it is not the
goal of this work, we propose a possible direct adaptation of a
comparison method used in supervised learning. This method has
been chosen due to the proximity of the supervised learning area
to clustering and because the use of statistical tests in this area
has been widely studied44–46.
We next describe the method and the proposed adaptation.
Then, we conclude this section by discussing the results obtainedwhen we applied the proposed tests to the results obtained in the
experiment carried out in our work to compare the performance
We based our statistical method on a common scenario in
supervised learning where classi?cation algorithms are com-
pared. In this case it is usual to run the algorithms on several
datasets and to compute a ”quality” estimate, such as the
accuracy or the AUC value, for each algorithm and database pair.
A usual approach is to test the quality values achieved by all the
algorithms for each dataset independently45. However, Dems?
44recently argued that a single test based on all the algorithms
and all the datasets is a better choice. One of the advantages of
this method is that the different values compared in the statistical
test are independent, since they come from different datasets.
We have adapted the method proposed by Dems?
subsequently extended by Garc?´
a and Herrera46to CVI com-
parisons. In brief, we simply replaced the classi?cation algorithms
Success rate (%)
Fig. 8.Results for synthetic datasets broken down by noise.
Success rate (%)
Fig. 9.Results for synthetic datasets broken down by clustering algorithm.O. Arbelaitz et al. / Pattern Recognition 46 (2013) 243–256 252
by CVIs. However, this is not enough, since in our experiments we
obtained a Boolean value for each CVI-con?guration pair instead
of a ”quality” estimate. Moreover, the con?gurations we obtained
by varying the clustering algorithm and partition similarity
measure are based on the same dataset, so it can be argued that
they are not suf?ciently independent.
Our solution was to add for each dataset the number of
successes each CVI obtained for each clustering algorithm–parti-
tion similarity measure pair. Moreover, in order to obtain a more
precise estimate, we also added the number of successes obtained
in every run—remember that we created 10 datasets for each
combination of dataset characteristics. We thus obtained 72
values ranging from 0 to 90 for each CVI, that gave us a ”quality”
estimate for independent datasets. Finally, we applied the statis-
tical tests with no further modi?cations.
The tests we used were designed for comparisons of multiple
classi?ers (CVIs) in an all-to-all way. We used the Friedman test
to check if any statistical difference existed and the Nemenyi test
for pairwise CVI comparison44. Furthermore, we performedadditional pairwise CVI comparisons with the Shaffer test as
suggested by Garc?´
a and Herrera46. In both cases we performed
the tests with 5% and 10% con?dence level.
The main conclusion obtained by applying the above tests is
that there are undoubtedly statistically signi?cant differences
between the 30 CVIs, as the Friedman test categorically shows
with ap-value on the order of 10
80. All the performed pair-wise
comparisons show a very similar result, so inFig. 13we only
show the results for the most powerful test that we
performed—Shaffer with a con?dence level of 10%.
Since the used statistical tests are based on average rank values,
the ?gure shows all the CVIs sorted by average rank. The results are
very similar to those based on average scores (Fig. 2), but there are
a couple of differences that should be underlined. First of all, the
CVI order slightly changed, but most of the movements occurred in
the central part of the ranking. Secondly, the CVIs formed quite
well separated groups. In the ?rst group there are 10 indices with
an average rank between 9 and 13. Taking into account variations
of a CVI as a single one, the group contains six indices: Silhouette,
Success rate (%)
Fig. 10.Overall results for real datasets.
Success rate (%)
Fig. 11.Results for real datasets broken down by partition similarity measure.O. Arbelaitz et al. / Pattern Recognition 46 (2013) 243–256253
Davies–Bouldin, Calinski–Harabasz, generalized Dunn, COP and
SDbw. There is also a crowded central group with 14 CVIs and
average rank between 14 and 17; and ?nally, a group of six indices
with average rank between 19 and 23.
The bars in the ?gure group the indices that do not show
statistically signi?cant differences. The highly overlapped bars
dif?cult the task of drawing categorical conclusions, but on the
following we resume the information in the graph and remark the
most interesting points:
No signi?cant difference exists between CVIs in the same
All the CVIs in the ?rst group perform signi?cantly better than
the CVIs in the third group.
The best behaving CVI, Sil, obtains signi?cantly better results
than all the CVIs in the second group, except Sym.
All the CVIs in the second group, except Sym and SymDB, have
no statistically signi?cant differences with at least one CVI in
the third group.
In conclusion, the data does not show suf?ciently strong
evidence to distinguish a small set of CVIs as being signi?cantly
better than the rest. Nevertheless, there is a group of about 10
indices that seems to be recommendable and Silhouette, Davies–
Bouldin* and Calinski–Harabasz are in the top of this group. We
have also performed statistical test to the experiment subsets
shown in the results section, but no CVI can be considered
signi?cantly better than the others in any case.6. Conclusions and further work
In this paper we presented a comparison of 30 cluster validity
indices on an extensive set of con?gurations. It is, to the best of
our knowledge, the most extensive CVI comparison ever pub-
lished. Moreover, it is the ?rst non-trivial CVI comparison
that uses the methodological correction recently proposed by
Gurrutxaga et al.28.
Due to the huge size of the experiment we have not been able to
show all the results obtained. However, the interested reader can
access them in electronic format in the web. The great advantage of
this is that readers can focus on the results for the con?gurations
they are interested in and we therefore provide a tool to enable
them to select the most suitable CVIs for their particular application.
This procedure is very recommendable since there is not a single CVI
that showed clear advantage over the rest in every context, although
Silhouette index obtained the best results in many of them.
We next summarize the main conclusions we drew from the CVI
comparison. First, we observed that some CVIs appear to be more
suitable for certain con?gurations, although the results were not
conclusive. Furthermore, the overall trend never changed dramatically
when we focused on a particular factor. Another fact worth noting is
that the results for real and synthetic datasets are qualitatively similar,
although they show disagreements for some particular indices.
With regard to the experimental factors, noise and cluster
overlap had the greatest impact on CVI performance. The number
of successes is dramatically reduced when noise is present or
clusters overlap. In particular, the inclusion of 10% random noise
reduces the average score to a third part. A very similar score
reduction was found when the clusters were moved closer so they
highly overlapped. Another remarkable and surprising fact is that
some indices showed better results in (a priori) more complex
con?gurations. For example, some indices improved their results
when the dimensionality of the datasets increased or the homo-
geneity of the cluster densities disappeared.
Finally, we con?rmed that the selection of a partition similar-
ity measure that enables correction of the experimental metho-
dology is not a critical factor. Nevertheless, it is clear that it can
produce some variations in the results, so our suggestion is to use
several of them to obtain more robust results. Our work shows
that CVIs appear to be better adapted to the VI and Jaccard
partition similarity measures than to Adjusted Rand.
Success rate (%)
Fig. 12.Results for real datasets broken down by clustering algorithm.
Fig. 13.Results for Shaffer test with a signi?cance level of 10%.O. Arbelaitz et al. / Pattern Recognition 46 (2013) 243–256 254
An statistical signi?cance analysis of the results showed that
there are three main groups of indices and the indices in the ?rst
group—Silhouette, Davies–Bouldin, Calinski–Harabasz, general-
ized Dunn, COP and SDbw—behave better than indices in the last
group—Dunn and its Point Symmetry-Distance based variation,
Gamma, C-Index, Negentropy increment and OS-Index—being
the differences statistically signi?cant.
This work also raises some questions and, therefore, suggests
some future work. It is obvious that this type of work can always
be improved. Although we consider that we performed an
extensive comparison there is room for extending it to include
more CVIs, datasets, clustering algorithms and so on. In this
context noise and overlap would appear to be the most interest-
ing factors to analyse in greater depth. We also limited this work
to crisp clustering, so a fuzzy CVI comparison would be a natural
continuation. The analysis of some other kind of indices, such as
stability based ones, would also be of great interest.
Finally, we argued that statistical tests are a very valuable tool
in data mining and that an effort should be made to use them
more widely in clustering. We adapted a method widely accepted
in the supervised learning area for our work, but this is just a ?rst
approach to the problem and there is a vast ?eld of theoretical
research to be addressed.
This work was funded by the University of the Basque Country,
general funding for research groups (Aldapa, GIU10/02), by the
Science and Education Department of the Spanish Government
(ModelAccess project, TIN2010-15549), by the Basque Govern-
ment’s SAIOTEK program (Datacc project, S-PE11UN097) and by
n Foral de Gipuzkoa (Zer4you project, DG10/5).
1 M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques,
Journal of Intelligent Information Systems 17 (2001) 107–145.
2 A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice-Hall, Inc., Upper
Saddle River, NJ, USA, 1988.
3 B. Mirkin, Clustering for Data Mining: A Data Recovery Approach, Chapman ;
Hall/CRC, Boca Raton, Florida, 2005.
4 P.H.A. Sneath, R.R. Sokal, Numerical Taxonomy, Books in Biology, W.H.
Freeman and Company, San Francisco, 1973.
5 K.J. Holzinger, H.H. Harman, Factor Analysis, University of Chicago Press,
6 C.-H. Chou, M.-C. Su, E. Lai, A new cluster validity measure and its application
to image compression, Pattern Analysis and Applications 7 (2004) 205–220.
7 D. Barbara´
, S. Jajodia (Eds.), Applications of Data Mining in Computer
Security, Kluwer Academic Publishers, Norwell, Massachusetts, 2002.
8 N.R. Pal, J. Biswas, Cluster validation using graph theoretic concepts, Pattern
Recognition 30 (1997) 847–857.
9 I. Guyon, U. von Luxburg, R.C. Williamson, Clustering: science or art?, in: NIPS
2009 Workshop on Clustering Theory, Vancouver, Canada, 2009.
10 M. Brun, C. Sima, J. Hua, J. Lowey, B. Carroll, E. Suh, E.R. Dougherty, Model-
based evaluation of clustering validation measures, Pattern Recognition 40
11 D. P?tzner, R. Leibbrandt, D. Powers, Characterization and evaluation of
similarity measures for pairs of clusterings, Knowledge and Information
Systems 19 (2009) 361–394.
12 V. Batagelj, M. Bren, Comparing resemblance measures, Journal of Classi?ca-
tion 12 (1995) 73–90.
13 J.C. Dunn, A fuzzy relative of the ISODATA process and its use in detecting
compact well-separated clusters, Journal of Cybernetics 3 (1973) 32–57.
14 D.L. Davies, D.W. Bouldin, A clustering separation measure, IEEE Transactions
on Pattern Analysis and Machine Intelligence 1 (1979) 224–227.
15 T. Calinski, J. Harabasz, A dendrite method for cluster analysis, Communica-
tions in Statistics 3 (1974) 1–27.16 A. Ben-Hur, A. Elisseeff, I. Guyon, A stability based method for discovering
structure in clustered data, in: Biocomputing 2002 Proceedings of the Paci?c
Symposium, vol. 7, 2002, pp. 6–17.
17 A.K. Jain, J. Moreau, Bootstrap technique in cluster analysis, Pattern Recogni-
tion 20 (1987) 547–568.
18 E. Dimitriadou, S. Dol?
nicar, A. Weingessel, An examination of indexes for
determining the number of clusters in binary data sets, Psychometrika 67
19 U. Maulik, S. Bandyopadhyay, Performance evaluation of some clustering
algorithms and validity indices, IEEE Transactions on Pattern Analysis and
Machine Intelligence 24 (2002) 1650–1654.
20 G.W. Milligan, M.C. Cooper, An examination of procedures for determining
the number of clusters in a data set, Psychometrika 50 (1985) 159–179.
21 M. Halkidi, M. Vazirgiannis, A density-based cluster validity approach using
multi-representatives, Pattern Recognition Letters 20 (2008) 773–786.
22 A. Hardy, On the number of clusters, Computational Statistics ; Data Analysis
23 (1996) 83–96.
23 L.F. Lago-Ferna´
ndez, F. Corbacho, Normality-based validation for crisp clus-
tering, Pattern Recognition 43 (2010) 782–795.
Zalik, Validity index for clusters of different sizes and densities,
Pattern Recognition Letters 32 (2011) 221–234.
25 M. Kim, R.S. Ramakrishna, New indices for cluster validity assessment,
Pattern Recognition Letters 26 (2005) 2353–2363.
26 S. Saha, S. Bandyopadhyay, Performance evaluation of some symmetry-based
cluster validity indexes, IEEE Transactions on Systems, Man, and Cybernetics,
Part C 39 (2009) 420–425.
27 R.C. Dubes, How many clusters are best? – an experiment, Pattern Recogni-
tion 20 (1987) 645–663.
28 I. Gurrutxaga, J. Muguerza, O. Arbelaitz, J.M. Pe´
rez, J.I. Mart?´
n, Towards a
standard methodology to evaluate internal cluster validity indices, Pattern
Recognition Letters 32 (2011) 505–515.
29 J.C. Bezdek, W.Q. Li, Y. Attikiouzel, M. Windham, A geometric approach to
cluster validity for normal mixtures, Soft Computing—A Fusion of Founda-
tions, Methodologies and Applications 1 (1997) 166–179.
30 S. Bandyopadhyay, S. Saha, A point symmetry-based clustering technique for
automatic evolution of clusters, IEEE Transactions on Knowledge and Data
Engineering 20 (2008) 1441–1457.
31 L. Hubert, P. Arabie, Comparing partitions, Journal of Classi?cation 2 (1985)
32 D.-J. Kim, Y.-W. Park, D.-J. Park, A novel validity index for determination of
the optimal number of clusters, IEICE Transactions on Information and
Systems E84-D (2001) 281–285.
33 C.A. Sugar, G.M. James, Finding the number of clusters in a dataset, Journal of
the American Statistical Association 98 (2003) 750–763.
34 F.B. Baker, L.J. Hubert, Measuring the power of hierarchical cluster analysis,
Journal of the American Statistical Association 70 (1975) 31–38.
35 L.J. Hubert, J.R. Levin, A general statistical framework for assessing categorical
clustering in free recall, Psychological Bulletin 83 (1976) 1072–1080.
36 P. Rousseeuw, Silhouettes: a graphical aid to the interpretation and valida-
tion of cluster analysis, Journal of Computational and Applied Mathematics
20 (1987) 53–65.
37 J.C. Bezdek, N.R. Pal, Some new indexes of cluster validity, IEEE Transactions
on Systems, Man, and Cybernetics, Part B 28 (1998) 301–315.
38 M. Halkidi, M. Vazirgiannis, Clustering validity assessment: ?nding the optimal
partitioning of a data set, in: Proceedings of the First IEEE International
Conference on Data Mining (ICDM’01), California, USA, 2001, pp. 187–194.
39 S. Saitta, B. Raphael, I. Smith, A bounded index for cluster validity, in:
P. Perner (Ed.), Machine Learning and Data Mining in Pattern Recognition,
Lecture Notes in Computer Science, vol. 4571, Springer, Berlin, Heidelberg,
2007, pp. 174–187.
40 I. Gurrutxaga, I. Albisua, O. Arbelaitz, J.I. Mart?´
n, J. Muguerza, J.M. Pe´
I. Perona, SEP/COP: an ef?cient method to ?nd the best partition in
hierarchical clustering based on a new cluster validity index, Pattern
Recognition 43 (2010) 3364–3373.
41 P. Jaccard, Nouvelles recherches sur la distribution ?orale, Bulletin de la
Vaudoise de Sciences Naturelles 44 (1908) 223–370.
42 M. Meil?
a, Comparing clusterings by the variation of information, in: Proceed-
ings of the Sixteenth Annual Conference on Computational Learning Theory
(COLT), 2003, pp. 173–187.
43 A. Frank, A. Asuncion, UCI machine learning repository, 2010.
44 J. Dems?
ar, Statistical comparisons of classi?ers over multiple data sets,
Journal of Machine Learning Research 7 (2006) 1–30.
45 T.G. Dietterich, Approximate statistical tests for comparing supervised
classi?cation learning algorithms, Neural Computation 10 (1998)
46 S. Garc?´
a, F. Herrera, An extension on ”statistical comparisons of classi?ers
over multiple data sets” for all pairwise comparisons, Journal of Machine
Learning Research 9 (2008) 2677–2694.
Olatz Arbelaitzreceived the M.Sc. and Ph.D. degrees in Computer Science from the University of the Basque Country in 1993 and 2002, respectively. She is an Associate
Professor in the Computer Architecture and Technology Department of the University of the Basque Country. She has worked in autonomous robotics, combinatorial
optimization and supervised and unsupervised machine learning techniques, focusing lately in web mining.O. Arbelaitz et al. / Pattern Recognition 46 (2013) 243–256255
Ibai Gurrutxagareceived the M.Sc. and Ph.D. degrees in Computer Science from the University of the Basque Country in 2002 and 2010. He is an Associate Professor in the
Computer Architecture and Technology Department of the University of the Basque Country. He is working in data mining and pattern recognition, focusing on supervised
and unsupervised classi?cation (decision trees, clustering, computer security and intrusion detection), and high performance computing.
Javier Muguerzareceived the M.Sc. and Ph.D. degrees in Computer Science from the University of the Basque Country in 1990 and 1996, respectively. He is an Associate
Professor in the Computer Architecture and Technology Department of the University of the Basque Country. His research interests include data mining, pattern
recognition and high performance computing.
rezreceived the M.Sc. and Ph.D. degrees in Computer Science from the University of the Basque Country in 1993 and 2006, respectively. He is an Associate
Professor in the Computer Architecture and Technology Department of the University of the Basque Country. His research interests include data mining and pattern
recognition techniques, focusing on classi?ers with explanation capacities, learning from imbalanced data and statistical analysis.
igo Peronareceived the M.Sc. degree in Computer Science from the University of the Basque Country in 2008. He is granted to pursue the Ph.D. at the Computer
Architecture and Technology Department of the University of the Basque Country. He is working in data mining and pattern recognition, focusing on supervised and
unsupervised classi?cation (web mining, clustering, computer security and intrusion detection).O. Arbelaitz et al. / Pattern Recognition 46 (2013) 243–256 256