Machine Learning Algorithms(Second Edition)
上QQ阅读APP看书,第一时间看更新

R2 score

Another very important metric used in regressions is called the coefficient of determination or R2. It measures the amount of variance of the prediction which is explained by the dataset. In other words, given the variance of the data generating process pdata, this metric is proportional to the probability of predicting new samples that actually belong to pdata. Intuitively, if the regression hyperplane approximated the majority of samples with an error below a fixed threshold, we can assume that future values will be correctly estimated. On the other side, if, for example, the slope allows to have a small error only for a part of the dataset, the probability of future wrong prediction increases because the model is not able to capture the complete dynamics.

To introduce the measure, let's define as residual the following quantity:

In other words, it is the difference between the sample and the prediction. So, the R2 is defined as follows:

The term  represents the average computed over all samples. For our purposes, R2 values close to 1 mean an almost-perfect regression, while values close to 0 (or negative) imply a bad model. It's very easy to use this metric together with CV:

print(cross_val_score(lr, X, Y, cv=10, scoring='r2').mean())
0.2

The result is low, meaning that the model can easily fail on future prediction (contrary to the result provided by the score() method). In fact, we can have a confirmation considering the standard deviation:

print(cross_val_score(lr, X, Y, cv=10, scoring='r2').std())
0.599

This is not surprising, considering the variance of the original dataset and the presence of several anomalies. The great advantage of a CV method is to minimize the risk of selecting only points that are close to the regression hyperplane. 

The reader should be aware that such a model will be very likely to yield inaccurate predictions and shouldn't try to find a better train/test split. A reasonable solution is characterized by a high CV score mean and low standard deviation. On the other side, when the CV standard deviation is high, another solution should be employed because the algorithm is very sensitive to the structure training set. We are going to analyze some more powerful models at the end of this chapter; however, the choice must normally be restricted to non-linear solutions, which are able to capture complex dynamics (and, thus, to explain more variance).