R squared

This quantity is often misunderstood and/or misrepresented. Let’s assume we have some random variable $y$ with mean $\bar{y}$ and some estimator for it $\hat{y}$. The commonly accepted fundamental definition of $R^2$ is:

$R^2=1 - \frac{\sum(\hat{y_i} - y_i)^2}{\sum(\bar{y_i} - y_i)^2} = 1 - \frac{SSE}{SST}$

Sometimes people call this the “fraction of explained variance”. But that’s only the right way to look at it under special circumstances. All the equation above shows is that $R^2$ “compares” the model $\hat{y}$ to the model $\bar{y}$. It’s a little like comparing a random walk model to something someone clever cooked up hoping to beat it. Note immediately that there’s nothing stopping $R^2$ from being negative, and so the square in notation is unfortunate in that respect. If you choose a bad enough model, say $\hat{y} = \bar{y} - \alpha$ for any $\alpha \in \mathbb{R}$ with $\alpha \neq 0$, well then $R^2$ will be negative*.

But the comparison made here is not simply comparing the errors of the two models. Instead it compares two variances (which harkens back to Principal Component Analysis, and I will update this blog with that discussion) and then subtracts that ratio from one. If the model for $\hat{y}$ has the property that the errors are zero on average, then the numerator is proportional to the variance of the errors. The denominator is proportional to the variance of $y$. So, if errors are denoted by $\epsilon$, that ratio is just $\frac{V(\epsilon)}{V(y)}$. So whenever the variance of errors are greater than the variance of $y$, $R^2$ ends up in negative territory.

In terms of this fraction of explained variance interpretation, that turns out to only be the case when $Cov(y, \hat{y}) \geq \frac{1}{2}V(\hat{y})$, which follows from simple relation below:

$V(\epsilon) = V(y) + V(\hat{y}) - 2Cov(y, \hat{y})$

Although that all seems obvious, given the simple equations above, it’s actually *not* always done in mainstream practice. For example, the statsmodels package in Python gets this mixed up. Try running sm.OLS(Y, X, const=False) and checking out the excellent $R^2$. Better yet, check out statsmodels VIF calc, which makes the same mistake. I will add more decision around that to this blog.

* To see that any constant guess other than the mean itself will result in a negative R squared, consider one such guess as $\hat{y} = \bar{y} + \alpha$. Then we have,

$SSE = \sum (\bar{y} - y_i + \alpha)^2 = \sum \bar{y}^2 + y_i^2 + \alpha^2 + 2 \bar{y}\alpha - 2 y_i \alpha - 2 \bar{y} y_i$

$=\sum (\bar{y} - y_i)^2 + \alpha^2 - 2\alpha(y_i - \bar{y})$

$=SST + \epsilon$

and it remains to show that $\epsilon$ is greater than zero:

$\epsilon = \sum \alpha^2 - 2\alpha(y_i - \bar{y}) = n\alpha^2 - 2\alpha \sum{y_i} + 2n\alpha \bar{y} = n\alpha^2$

And therefore $R^2 = 1 - SSE/SST = - \epsilon / SST < 0$

Some useful discussions below

https://www.mathworks.com/help/stats/coefficient-of-determination-r-squared.html

https://en.wikipedia.org/wiki/Coefficient_of_determination