Linkage_Applied Unsupervised Learning with Python-历史网

上QQ阅读APP看书，第一时间看更新

Linkage

In Exercise 7, Building a Hierarchy, you implemented hierarchical clustering using what is known as Centroid Linkage. Linkage is the concept of determining how you can calculate the distances between clusters and is dependent on the type of problem you are facing. Centroid linkage was chosen for the first activity as it essentially mirrors the new centroid search that we used in k-means. However, this is not the only option when it comes to clustering data points together. Two other popular choices for determining distances between clusters are single linkage and complete linkage.

Single Linkage works by finding the minimum distance between a pair of points between two clusters as its criteria for linkage. Put simply, it essentially works by combining clusters based on the closest points between the two clusters. This is expressed mathematically as follows:

dist(a,b) = min( dist( a[i]), b[j] ) )

Complete Linkage is the opposite of single linkage and it works by finding the maximum distance between a pair of points between two clusters as its criteria for linkage. Put simply, it works by combining clusters based on the furthest points between the two clusters. This is mathematically expressed as follows:

dist(a,b) = max( dist( a[i]), b[j] ) )

Determining what linkage criteria is best for your problem is as much about art as it is about science and it is heavily dependent on your particular dataset. One reason to choose single linkage is that your data is similar in a nearest-neighbor sense, therefore, when there are differences, then the data is extremely dissimilar. Since single linkage works by finding the closest points, it will not be affected by these distant outliers. Conversely, complete linkage may be a better option if your data is distant in terms of inter-cluster, however, it is quite dense intra-cluster. Centroid linkage has similar benefits but falls apart if the data is very noisy and there are less clearly defined "centers" of clusters. Typically, the best approach is to try a few different linkage criteria options and to see which fits your data in a way that's most relevant to your goals.

Activity 2: Applying Linkage Criteria

Recall the dummy data of the eight clusters that we generated in the previous exercise. In the real world, you may be given real data that resembles discrete Gaussian blobs in the same way. Imagine that the dummy data represents different groups of shoppers in a particular store. The store manager has asked you to analyze the shopper data in order to classify the customers into different groups, so that they can tailor marketing materials to each group.

Using the data already generated in the previous exercise, or by generating new data, you are going to analyze which linkage types do the best job of grouping the customers into distinct clusters.

Once you have generated the data, view the documents supplied using SciPy to understand what linkage types are available in the linkage function. Then, evaluate the linkage types by applying them to your data. The linkage types you should test are shown in the following list:

['centroid', 'single', 'complete', 'average', 'weighted']

By completing this activity, you will gain an understanding of the linkage criteria – which is important to understand how effective your hierarchical clustering is. The aim is to gain an understanding of how linkage criteria play a role in different datasets and how it can make a useless clustering into a valid one.

You may realize that we have not covered all of the previously mentioned linkage types – a key part of this activity is to learn how to parse the docstrings provided using packages to explore all of their capabilities.

Here are the steps required to complete this activity:

Visualize the dataset that we created in Exercise 7, Building a Hierarchy.
Create a list with all the possible linkage method hyperparameters.
Loop through each of the methods in the list that you just created and display the effect they have on the same dataset.

You should generate a plot for each linkage type and use the plots to comment on which linkage types are most suitable for this data.

The plots that you will generate should look similar to the ones in the following diagram:

Figure 2.14: The expected scatter plots for all methods

Note

The solution for this activity is on page 310.