# Installing NVIDIA DIGIST Ubuntu 16.04

### caffe

Install caffe as being explained in my other post here.

### DIGITS

#### Open in the browser:

http://localhost:5000/

# Hierarchical Clustring in python



Hierarchical Clustering is a method of clustering which build a hierarchy of clusters. It could be Agglomerative or Divisive.

1. Agglomerative: At the first step, every item is a cluster, then clusters based on their distances are merged and form bigger clusters till all data is in one cluster (Bottom Up). The complexity is $$O (n^2log(n) )$$.
2. Divisive: At the beginning, all items are in one big cluster. Then iteratively we break this cluster into smaller clusters (Top Down). The complexity is  $$O (2^n)$$.

To merge or divide the clusters we need to know the shortest distance between clusters. The common metrics for the distance between clusters are:

• Single Link: smallest distance between points.
• Complete Link: largest distance between points.
• Average Link: average distance between points
• Centroid: distance between centroids.

Depending on the definition of ‘shortest distance’ (single/ complete/ average/ centroid link   ) we have different hierarchical clustering method.

Hierarchical Algorithms:

1. Single Link: at each iteration, two clusters that have the closest pair of elements will be merged into a bigger cluster.
2. Average Link: distance between clusters is the average distance between all points in between clusters. Clusters with the minimum of these distances merge into a bigger cluster.
3. Complete Link: distance between clusters is the distance between those two points that are farthest away from each other. Two clusters with the minimum of these distances merge into a bigger cluster.
4. Minimum spanning tree (MST): In a connected graph without any cycle, a spanning tree is a subset tree in which all vertex are still connected. If edges have weight, MST is a span tree in which the edges have the minimum weight. MST may not be unique.

to visualize the outcome of the hierarchical clustering we often use “Dendrogram”.

The following graph represents the following matrix :

Minimum spanning tree of the graph.

# Naive Bayes Classifier Example with Python Code

In the below example I implemented a “Naive Bayes classifier” in python and in the following I used “sklearn” package to solve it again:

and the output is:

# Density-Based Spatial Clustering (DBSCAN) with Python Code

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a data clustering algorithm It is a density-based clustering algorithm because it finds a number of clusters starting from the estimated density distribution of corresponding nodes.

It starts with an arbitrary starting point that has not been visited.

This point’s epsilon-neighborhood is retrieved, and if it contains sufficiently many points, a cluster is started. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise. DBSCAN requires two parameters: epsilon (eps) and the minimum number of points required to form a cluster (minPts). If a point is found to be part of a cluster, its epsilon-neighborhood is also part of that cluster.

I implemented the pseudo code from DBSCAN wiki page:

# Kernel Density Estimation (KDE) for estimating probability distribution function



There are several approaches for estimating the probability distribution function of a given data:

1)Parametric
2)Semi-parametric
3)Non-parametric

A parametric one is GMM via algorithm such as expectation maximization. Here is my other post for expectation maximization.

Example of Non-parametric is the histogram, where data are assigned to only one bin and depending on the number bins that fall within an interval the height of histogram will be determined.

Kernel Density Estimation (KDE) is an example of a non-parametric method for estimating the probability distribution function. It is very similar to histogram but we don’t assign each data to only to a bin. In KDE we use a kernel function which weights data point, depending on how far are they from the point $$x$$.

\hat{f}(x) = \frac{1}{nh} \sum_{i=1}^n k\bigg(\frac{ x-x_i  }{h}\bigg)

where $$h$$ is a bandwidth parameter and $$k$$ is the kernel function. One choice for kernel function is the Gaussian (normal distribution)  but there are other kernel functions (uniform, triangular, biweight, triweight, Epanechnikov) that can be used as well. Choosing too small or too bog values for bandwidth might overfit or under fit our estimation. A rule of thumb for choosing bandwidth is Silverman rule.

# Silhouette coefficient for finding optimal number of clusters

Silhouette coefficient is another method to determine the optimal number of clusters. Here I introduced c-index earlier. The silhouette coefficient of a data measures how well data are assigned to its own cluster and how far they are from other clusters. A silhouette close to 1 means the data points are in an appropriate cluster and a silhouette coefficient close to −1 implies out data is in the wrong cluster. The following is python code for computing the coefficient and plotting number fo clusters vs Silhouette coefficient.