Introduction to Hierarchical Clustering_Applied Unsupervised Learning with Python-QQ阅读中文都市网

上QQ阅读APP看书，第一时间看更新

Introduction to Hierarchical Clustering

Until this point, we have shown that hierarchies can be excellent structures in which to organize information that clearly show nested relationships among data points. While this is helpful in gaining an understanding of the parent/child relationships between items, it can also be very handy when forming clusters. Expanding on the animal example of the prior section, imagine that you were simply presented with two features of animals: their height (measured from the tip of the nose to the end of the tail) and their weight. Using this information, you then have to recreate the same structure in order to identify which records in your dataset correspond to dogs or cats, as well as their relative subspecies.

Since you are only given animal heights and weights, you won't be able to extrapolate the specific names of each species. However, by analyzing the features that you have been provided, you can develop a structure within the data that serves as an approximation of what animal species exist in your data. This perfectly sets the stage for an unsupervised learning problem that is well solved with hierarchical clustering. In the following plot, you will see the two features that we created on the left: with animal height in the left-hand column and animal weight in the right-hand column. This is then charted on a two-axis plot with the height on the x axis and the weight on the y axis:

Figure 2.4: An example of a two-feature dataset comprising animal height and animal weight

One way to approach hierarchical clustering is by starting with each data point serving as its own cluster and recursively joining the similar points together to form clusters – this is known as agglomerative hierarchical clustering. We will go into more detail about the different ways of approaching hierarchical clustering in a later section.

In the agglomerative hierarchical clustering approach, the concept of data point similarity can be thought of in the paradigm that we saw during k-means. In k-means, we used the Euclidean distance to calculate the distance from the individual points to the centroids of the expected "k" clusters. For this approach to hierarchical clustering, we will reuse the same distance metric to determine the similarity between the records in our dataset.

Eventually, by grouping individual records from the data with their most similar records recursively, you end up building a hierarchy from the bottom up. The individual single-member clusters join together into one single cluster at the top of our hierarchy.

Steps to Perform Hierarchical Clustering

To understand how agglomerative hierarchical clustering works, we can trace the path of a simple toy program as it merges together to form a hierarchy:

Given n sample data points, view each point as an individual "cluster" with just that one point as a member.
Calculate the pairwise Euclidean distance between the centroids of all the clusters in your data.
Group the closest point pairs together.
Repeat Step 2 and Step 3 until you reach a single cluster containing all the data in your set.
Plot a dendrogram to show how your data has come together in a hierarchical structure. A dendrogram is simply a diagram that is used to represent a tree structure, showing an arrangement of clusters from top to bottom.
Decide what level you want to create the clusters at.

An Example Walk-Through of Hierarchical Clustering

While slightly more complex than k-means, hierarchical clustering does not change too much from a logistical perspective. Here is a simple example that walks through the preceding steps in slightly more detail:

Given a list of four sample data points, view each point as a centroid that is also its own cluster with the point indices from 0 to 3:
Clusters (4): [ (1,7) ], [ (-5,9) ], [ (-9,4) ] , [ (4, -2) ]
Centroids (4): [ (1,7) ], [ (-5,9) ], [ (-9,4) ] , [ (4, -2) ]
Calculate the pairwise Euclidean distance between the centroids of all the clusters. In the matrix displayed in the following diagram, the point indices are between 0 and 3 both horizontally and vertically, showing the distance between the respective points. Along the diagonal are extremely high values to ensure that we do not select a point as its own neighbor (since it technically is the "closest" point). Notice that the values are mirrored across the diagonal:

Figure 2.5: An array of distances
Group the closest point pairs together.
In this case, points [1,7] and [-5,9] join into a cluster since they are closest, with the remaining two points left as single-member clusters:

Figure 2.6: An array of distances
Here are the resulting three clusters:
[ [1,7], [-5,9] ]
[-9,4]
[4,-2]
Calculate the centroid of the two-member cluster, as follows:
mean([ [1,7], [-5,9] ]) = [-2,8]
Add the centroid to the two single-member centroids and recalculate the distances.
Clusters (3):
[ [1,7], [-5,9] ]
[-9,4]
[4,-2]
Centroids (3):
[-2,8]
[-9,4]
[4,-2]
The output will be similar to the following diagram, with the shortest distance called using a red arrow:

Figure 2.7: An array of distances
Since it has the shortest distance, point [-9,4] is added to cluster 1:
Clusters (2):
[ [1,7], [-5,9], [-9,4] ]
[4,-2]
With only point (4,-2) left as the furthest distance away from its neighbors, you can just add it to cluster 1 to unify all the clusters:
Clusters (1):
[ [ [1,7], [-5,9], [-9,4], [4,-2] ] ]
Plot a dendrogram to show the relationship between the points and the clusters:

Figure 2.8: A dendrogram showing the relationship between the points and the clusters

At the end of this process you can visualize the hierarchical structure that you created through a dendrogram. This plot shows how data points are similar and will look familiar to the hierarchical tree structures that we discussed earlier. Once you have this dendrogram structure, you can interpret how the data points relate to each other and subjectively decide at which "level" the clusters should exist.

Revisiting the previous animal taxonomy example from that involved dog and cat species, imagine that you were presented with the following dendrogram:

Figure 2.9: An animal taxonomy dendrogram

The great thing about hierarchical clustering and dendrograms is that you can see the entire breakdown of potential clusters to choose from. If you were just interested in grouping your species dataset into dogs and cats, you could stop clustering at the first level of the grouping. However, if you wanted to group all species into domesticated or non-domesticated animals, you could stop clustering at level two.

Exercise 7: Building a Hierarchy

Let's try implementing the preceding hierarchical clustering approach in Python. With the framework for the intuition laid out, we can now explore the process of building a hierarchical cluster with some helper functions provided in SciPy. This exercise uses SciPy, an open source library that packages functions that are helpful in scientific and technical computing; examples of this include easy implementations of linear algebra and calculus-related methods. In addition to SciPy, we will be using Matplotlib to complete this exercise:

Generate some dummy data as follows:
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
%matplotlib inline
# Generate a random cluster dataset to experiment on. X = coordinate points, y = cluster labels (not needed)
X, y = make_blobs(n_samples=1000, centers=8, n_features=2, random_state=800)
Visualize the data as follows:
plt.scatter(X[:,0], X[:,1])
plt.show()
The output is as follows:

Figure 2.10: A plot of the dummy data
After plotting this simple toy example, it should be pretty clear that our dummy data is comprised of eight clusters.
We can easily generate the distance matrix using the built-in SciPy package, 'linkage':
# Generate distance matrix with 'linkage' function
distances = linkage(X, method="centroid", metric="euclidean")
print(distances)
The output is as follows:

Figure 2.11: A matrix of the distances
In the first situation, you can see that customizing the hyperparameters really drives the performance when finding the ideal linkage matrix. If you recall our previous steps, linkage works by simply calculating the distances between each of the data points. In the linkage function, we have the option to select both the metric and the method (we will cover more on this later).
After we determine the linkage matrix, we can easily pass it through the dendrogram function provided by SciPy:
dn = dendrogram(distances)
plt.show()
The output is as follows:

Figure 2.12: A dendrogram of the distances
This plot will give us some perspective on the potential breakouts of our data.
Using this information, we can wrap up our exercise on hierarchical clustering by using the fcluster function from SciPy. The number 3 in the
following example represents the maximum inter-cluster distance threshold hyperparameter that you will set. This hyperparameter is tunable based on the dataset that you are looking at; however, it is supplied for you as 3 for this exercise:
scipy_clusters = fcluster(distances, 3, criterion="distance")
plt.scatter(X[:,0], X[:,1], c=scipy_clusters)
plt.show()
The output is as follows:

Figure 2.13: A scatter plot of the distances

By simply calling a few helper functions provided by SciPy, you can easily implement agglomerative clustering in just a few lines of code. While SciPy does help with many of the intermediate steps, this is still an example that is a bit more verbose than what you will probably see in your regular work. We will cover more streamlined implementations later.