Resampling with replacement
The most common way to address this issue is based on a resampling procedure. This approach is extremely simple, but, unfortunately, it has many drawbacks. Considering the previous example, we could decide to upsample the class 1, so as to match the number of samples belonging to class 0. However, we can only use the existing data and, after every sampling step, we restart from the original dataset (replacement). To better understand the procedure, let's suppose that we generate the dataset by employing the scikit-learn make_classification function (we are going to use it lots of times in the upcoming chapters):
from sklearn.datasets import make_classification
nb_samples = 1000
weights = (0.95, 0.05)
X, Y = make_classification(n_samples=nb_samples, n_features=2, n_redundant=0, weights=weights, random_state=1000)
We can check the shape of the two subarrays like so:
print(X[Y==0].shape)
print(X[Y==1].shape)
(946, 2) (54, 2)
As expected (we have imposed a class weighting), the first class is dominant. In upsampling with replacement, we proceed by sampling from the dataset that's limited to the minor class (1), until we reach the desired number of elements. As we perform the operation with replacement, it can be iterated any number of times, but the resultant dataset will always contain points sampled from 54 possible values. In scikit-learn, it's possible to perform this operation by using the built-in resample function:
import numpy as np
from sklearn.utils import resample
X_1_resampled = resample(X[Y==1], n_samples=X[Y==0].shape[0], random_state=1000)
Xu = np.concatenate((X[Y==0], X_1_resampled))
Yu = np.concatenate((Y[Y==0], np.ones(shape=(X[Y==0].shape[0], ), dtype=np.int32)))
The function samples from the subarray X[Y==1], generating the number of samples selected through the n_samples parameters (in our case, we have chosen to create two classes with the same number of elements). In the end, it's necessary to concatenate the subarray containing the samples with label 0 to the upsampled one (the same is also done with the labels). If we check the new shapes, we obtain the following:
print(Xu[Yu==0].shape)
print(Xu[Yu==1].shape)
(946, 2) (946, 2)
As expected, the classes are now balanced. Clearly, the same procedure can be done by downsampling the major class, but this choice should be carefully analyzed because, in this case, there is an information loss. Whenever the dataset contains many redundant samples, this operation is less dangerous, but, as can often happen, removing valid samples can negatively impact the final accuracy because some feature values could never be seeded during the training phase. Even if resampling with replacement is not extremely powerful (as it cannot generate new samples), I normally suggest upsampling as a default choice. Downsampling the major class is only justified when the variance of the samples is very small (there are many samples around the mean), and it's almost always an unacceptable choice for uniform distributions.