In this video, I explain the **“Naive Bayes Classifier”**. The example has been solved with phyton in my other post here

# Category Archives: Python

# Naive Bayes Classifier Example with Python Code

In the below example I implemented a “**Naive Bayes classifier**” in python and in the following I used “**sklearn**” package to solve it again:

and the output is:

1 2 3 4 5 6 7 |
male posterior is: 1.54428667821e-07 female posterior is: 0.999999845571 Then our data must belong to the female class Then our data must belong to the class number: [2] |

# Density-Based Spatial Clustering (DBSCAN) with Python Code

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a data clustering algorithm It is a density-based clustering algorithm because it finds a number of clusters starting from the estimated density distribution of corresponding nodes.

It starts with an arbitrary starting point that has not been visited.

This point’s epsilon-neighborhood is retrieved, and if it contains sufficiently many points, a cluster is started. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise. DBSCAN requires two parameters: epsilon (eps) and the minimum number of points required to form a cluster (minPts). If a point is found to be part of a cluster, its epsilon-neighborhood is also part of that cluster.

I implemented the pseudo code from DBSCAN wiki page:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
DBSCAN(D, eps, MinPts) C = 0 for each unvisited point P in dataset D mark P as visited N = getNeighbors (P, eps) if sizeof(N) < MinPts mark P as NOISE else C = next cluster expandCluster(P, N, C, eps, MinPts) expandCluster(P, N, C, eps, MinPts) add P to cluster C for each point P' in N if P' is not visited mark P' as visited N' = getNeighbors(P', eps) if sizeof(N') >= MinPts N = N joined with N' if P' is not yet member of any cluster add P' to cluster C |

# Kernel Density Estimation (KDE) for estimating probability distribution function

There are several approaches for estimating the probability distribution function of a given data:

1)Parametric

2)Semi-parametric

3)Non-parametric

A parametric one is **GMM** via algorithm such as **expectation maximization**. Here is my other post for expectation maximization.

Example of Non-parametric is the **histogram**, where data are assigned to only one bin and depending on the number bins that fall within an interval the height of histogram will be determined.

**Kernel Density Estimation** (KDE) is an example of a non-parametric method for estimating the probability distribution function. It is very similar to histogram but we don’t assign each data to only to a bin. In KDE we use a kernel function which weights data point, depending on how far are they from the point \( x \).

\begin{equation}

\hat{f}(x) = \frac{1}{nh} \sum_{i=1}^n k\bigg(\frac{ x-x_i }{h}\bigg)

\end{equation}

where \( h \) is a bandwidth parameter and \( k \) is the kernel function. One choice for kernel function is the Gaussian (normal distribution) but there are other kernel functions (uniform, triangular, biweight, triweight, Epanechnikov) that can be used as well. Choosing too small or too bog values for bandwidth might overfit or under fit our estimation. A rule of thumb for choosing bandwidth is Silverman rule.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
import numpy as np from scipy import stats import matplotlib.pylab as plt n_basesample = 1000 np.random.seed(234234) xn = np.random.randn(n_basesample) gkde=stats.gaussian_kde(xn) ind = np.linspace(-7,7,101) kdepdf = gkde.evaluate(ind) plt.figure() # plot histgram of sample plt.hist(xn, bins=20, normed=1) # plot data generating density plt.plot(ind, stats.norm.pdf(ind), color="r", label='Original Function') # plot estimated density plt.plot(ind, kdepdf, label='kde estimation', color="g") plt.title('Kernel Density Estimation') plt.legend() plt.show() |

# Silhouette coefficient for finding optimal number of clusters

**Silhouette coefficient** is another method to determine the optimal number of clusters. Here I introduced **c-index **earlier. The silhouette coefficient of a data measures how well data are assigned to its own cluster and how far they are from other clusters. A silhouette close to 1 means the data points are in an appropriate cluster and a silhouette coefficient close to −1 implies out data is in the wrong cluster. The following is python code for computing the coefficient and plotting number fo clusters vs Silhouette coefficient.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
from sklearn import cluster import numpy as numpy import sklearn import matplotlib.pyplot as plt obs = numpy.concatenate( (numpy.random.randn(100, 2) , 20 + numpy.random.randn(300, 2) , -15+numpy.random.randn(200, 2))) silhouette_score_values=list() NumberOfClusters=range(2,30) for i in NumberOfClusters: classifier=cluster.KMeans(i,init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True) classifier.fit(obs) labels= classifier.predict(obs) print "Number Of Clusters:" print i print "Silhouette score value" print sklearn.metrics.silhouette_score(obs,labels ,metric='euclidean', sample_size=None, random_state=None) silhouette_score_values.append(sklearn.metrics.silhouette_score(obs,labels ,metric='euclidean', sample_size=None, random_state=None)) plt.plot(NumberOfClusters, silhouette_score_values) plt.title("Silhouette score values vs Numbers of Clusters ") plt.show() Optimal_NumberOf_Components=NumberOfClusters[silhouette_score_values.index(max(silhouette_score_values))] print "Optimal number of components is:" print Optimal_NumberOf_Components |

# Finding optimal number of Clusters by using Cluster validation

\( \)

This module finds the optimal number of components (number of clusters) for a given dataset.

In order to find the optimal number of components for, first we used k-means algorithm

with a different number of clusters, starting from 1 to a fixed max number. Then we checked the cluster validity by deploying \( C-index \) algorithm and select the optimal number of

clusters with lowest \( C-index\). This index is defined as follows:

\begin{equation} \label{C-index}

C = \frac{S-S_{min}}{S_{max}-S_{min}}

\end{equation}

where \( S \) is the sum of distances over all pairs of patterns from the same cluster. Let \( l \)

be the number of those pairs. Then \( S_{min} \) is the sum of the l smallest distances if all

pairs of patterns are considered (i.e. if the patterns can belong to different clusters).

Similarly,\(Smax \) is the sum of the\( l\) largest distance out of all pairs. Hence a small value

of \( C \) indicates a good clustering. In the following code, I have generated 4 clusters, but since two of them are very close, they packed into one and the optimal number of clusters is 3.

# Gradient descent method for finding the minimum

\( \)

Gradient descent is a very popular method for finding the maximum/ minimum point of a given function. It’s very simple yet powerful but may trap in the local minima. Here I try to find the minimum of the following function:

$$ z= -( 4 \times e^{- ( (x-4)^2 +(y-4)^2 ) }+ 2 \times e^{- ( (x-2)^2 +(y-2)^2 ) } )$$

Here I have solved this function with Limited-Memory CMA-ES (LM-CMA-ES) and as you can see it didn’t trap in the local minima:

Here if we start at $$x=4$$ and $$y=2.2$$ we will trap in the local minima

if try different start point, for example, $$x=3.5$$ and $$y=2.2$$ we will find the minima:

code available at my GitHub.