Semi-supervised learning
There are many problems where the amount of labeled samples is very small compared with the potential number of elements. A direct supervised approach is infeasible because the data used to train the model couldn't be representative of the whole distribution, so therefore it's necessary to find a trade-off between a supervised and an unsupervised strategy. Semi-supervised learning has been mainly studied in order to solve these kinds of problems. The topic is a little bit more advanced and won't be covered in this book (the reader who is interested can check out Mastering Machine Learning Algorithms, Bonaccorso G., Packt Publishing). However, the main goals that a semi-supervised learning approach pursues are as follows:
- The propagation of labels to unlabeled samples considering the graph of the whole dataset. The samples with labels become attractors that extend their influence to the neighbors until an equilibrium point is reached.
- Performing a classification training model (in general, Support Vector Machines (SVM); see Chapter 7, Support Vector Machines, for further information) using the labeled samples to enforce the conditions necessary for a good separation while trying to exploit the unlabeled samples as balancers, whose influence must be mediated by the labeled ones. Semi-supervised SVMs can perform extremely well when the dataset contains only a few labeled samples and dramatically reduce the burden of building and managing very large datasets.
- Non-linear dimensionality reduction considering the graph structure of the dataset. This is one of most challenging problems due to the constraints existing in high-dimensional datasets (that is, images). Finding a low-dimensional distribution that represents the original one minimizing the discrepancy is a fundamental task necessary to visualize structures with more than three dimensions. Moreover, the ability to reduce the dimensionality without a significant information loss is a key element whenever it's necessary to work with simpler models. In this book, we are going to discuss some common linear techniques (such as Principal Component Analysis (PCA) that the reader will be able to understand when some features can be removed without impacting the final accuracy but with a training speed gain.
It should now be clear that semi-supervised learning exploits the ability of finding out separating hyperplanes (classification) together with the auto-discovery of structural relationships (clustering). Without a loss of generality, we could say that the real supervisor, in this case, is the data graph (representing the relationships) that corrects the decisions according to the underlying informational layer. To better understand the logic, we can imagine that we have a set of users, but only 1% of them have been labeled (for simplicity, let's suppose that they are uniformly distributed). Our goal is to find the most accurate labels for the remaining part. A clustering algorithm can rearrange the structure according to the similarities (as the labeled samples are uniform, we can expect to find unlabeled neighbors whose center is a labeled one). Under some assumptions, we can propagate the center's label to the neighbors, repeating this process until every sample becomes stable. At this point, the whole dataset is labeled and it's possible to employ other algorithms to perform specific operations. Clearly, this is only an example, but in real life, it's extremely common to find scenarios where the cost of labeling millions of samples is not justified considering the accuracy achieved by semi-supervised methods.