Kernel Density Estimation (KDE) for estimating probability distribution function



There are several approaches for estimating the probability distribution function of a given data:

1)Parametric
2)Semi-parametric
3)Non-parametric

A parametric one is GMM via algorithm such as expectation maximization. Here is my other post for expectation maximization.

Example of Non-parametric is the histogram, where data are assigned to only one bin and depending on the number bins that fall within an interval the height of histogram will be determined.

Kernel Density Estimation (KDE) is an example of a non-parametric method for estimating the probability distribution function. It is very similar to histogram but we don’t assign each data to only to a bin. In KDE we use a kernel function which weights data point, depending on how far are they from the point $$x$$.

\hat{f}(x) = \frac{1}{nh} \sum_{i=1}^n k\bigg(\frac{ x-x_i  }{h}\bigg)

where $$h$$ is a bandwidth parameter and $$k$$ is the kernel function. One choice for kernel function is the Gaussian (normal distribution)  but there are other kernel functions (uniform, triangular, biweight, triweight, Epanechnikov) that can be used as well. Choosing too small or too bog values for bandwidth might overfit or under fit our estimation. A rule of thumb for choosing bandwidth is Silverman rule.

Silhouette coefficient for finding optimal number of clusters

Silhouette coefficient is another method to determine the optimal number of clusters. Here I introduced c-index earlier. The silhouette coefficient of a data measures how well data are assigned to its own cluster and how far they are from other clusters. A silhouette close to 1 means the data points are in an appropriate cluster and a silhouette coefficient close to −1 implies out data is in the wrong cluster. The following is python code for computing the coefficient and plotting number fo clusters vs Silhouette coefficient.

Finding optimal number of Clusters by using Cluster validation


This module finds the optimal number of components (number of clusters) for a given dataset.
In order to find the optimal number of components for, first we used k-means algorithm
with a different number of clusters, starting from 1 to a fixed max number. Then we checked the cluster validity by deploying $$C-index$$ algorithm and select the optimal number of
clusters with lowest $$C-index$$. This index is defined as follows:

\label{C-index}
C = \frac{S-S_{min}}{S_{max}-S_{min}}

where $$S$$ is the sum of distances over all pairs of patterns from the same cluster. Let $$l$$
be the number of those pairs. Then $$S_{min}$$ is the sum of the l smallest distances if all
pairs of patterns are considered (i.e. if the patterns can belong to different clusters).
Similarly,$$Smax$$ is the sum of the$$l$$ largest distance out of all pairs. Hence a small value
of $$C$$ indicates a good clustering. In the following code, I have generated 4 clusters, but since two of them are very close, they packed into one and the optimal number of clusters is 3.

Gradient descent method for finding the minimum


Gradient descent is a very popular method for finding the maximum/ minimum point of a given function. It’s very simple yet powerful but may trap in the local minima. Here I try to find the minimum of the following function:

$$z= -( 4 \times e^{- ( (x-4)^2 +(y-4)^2 ) }+ 2 \times e^{- ( (x-2)^2 +(y-2)^2 ) } )$$

Here I have  solved this function with Limited-Memory CMA-ES (LM-CMA-ES)  and as you can see it didn’t trap in the local minima:

The path traversed with Limited-Memory CMA-ES (LM-CMA-ES) algorithm to reach the minimum of the function.

Here if we start at $$x=4$$ and  $$y=2.2$$ we will trap in the local minima

if try different start point, for example, $$x=3.5$$ and  $$y=2.2$$ we will find the minima:

code available at my GitHub.