Finding optimal number of Clusters by using Cluster validation


This module finds the optimal number of components (number of clusters) for a given dataset.
In order to find the optimal number of components for, first we used k-means algorithm
with a different number of clusters, starting from 1 to a fixed max number. Then we checked the cluster validity by deploying $$C-index$$ algorithm and select the optimal number of
clusters with lowest $$C-index$$. This index is defined as follows:

\label{C-index}
C = \frac{S-S_{min}}{S_{max}-S_{min}}

where $$S$$ is the sum of distances over all pairs of patterns from the same cluster. Let $$l$$
be the number of those pairs. Then $$S_{min}$$ is the sum of the l smallest distances if all
pairs of patterns are considered (i.e. if the patterns can belong to different clusters).
Similarly,$$Smax$$ is the sum of the$$l$$ largest distance out of all pairs. Hence a small value
of $$C$$ indicates a good clustering. In the following code, I have generated 4 clusters, but since two of them are very close, they packed into one and the optimal number of clusters is 3.