Model condition number and kappa()
A colleague got the following warning when fitting a regression model in Python:
The condition number is large, 3.11e+03. This might indicate that there are strong multicollinearity or other numerical problems.
We can compute this condition
number in R via
kappa()
m_mpg_disp_wt_int <- lm(mpg ~ disp * wt, mtcars)
kappa(m_mpg_disp_wt_int, exact = TRUE)
#> [1] 8508.152
This kappa score indicates something about sensitivity or stability in the model. “The condition number bounds how much the solution z of a system of equations Az=c can change, on a relative basis, when its components A and c are changed,” says whuber on CV.
R computes its default kappa from the ratio of the first and last singular values of the upper triangle of the model’s QR matrix. (I read the source code.)
zero_out_lower_tri <- function(m) { m[lower.tri(m)] <- 0; m }
d <- m_mpg_disp_wt_int$qr$qr |>
# there are 4 columns, so we are making a square
_[1:4, 1:4] |>
zero_out_lower_tri() |>
svd() |>
_$d
d[1] / d[length(d)]
#> [1] 8508.152
There is a lot of mathematical indirection here but the intuition I’m hanging onto is that if the SVD is warping a “circle” into an “ellipse”, the ratio in kappa is the ratio of longest and shortest axes in that ellipse.
We can reduce this condition number by (roughly) centering our predictors:
mean(mtcars$disp)
#> [1] 230.7219
mtcars$disp_230 <- mtcars$disp - 230
mean(mtcars$wt)
#> [1] 3.21725
mtcars$wt_3 <- mtcars$wt - 3
m_mpg_disp_wt_int <- lm(mpg ~ disp_230 * wt_3, mtcars)
kappa(m_mpg_disp_wt_int, exact = TRUE)
#> [1] 447.5556
For simple non-lmem models, I prefer kind of rough centering to nice round numbers so that the intercept can be easily described, and the centering constants can be remembered. When more exact centering is needed, I do full on z-scores instead:
mtcars$z_disp <- mtcars$disp |> scale() |> as.vector()
mtcars$z_wt <- mtcars$wt |> scale() |> as.vector()
m_mpg_disp_wt_int <- lm(mpg ~ z_disp * z_wt, mtcars)
kappa(m_mpg_disp_wt_int, exact = TRUE)
#> [1] 5.065835
As we can see, this transformation made a tremendous improvement in condition number.
Why doesn’t R raise a warning? There are probably lots of good reasons to not warn by default, because we are dealing with a diagnostic score and rules of thumb for interpreting it. That said, I was not able to find a clear answer from an R (or S) author about why R doesn’t warn about the condition number or multicollinearity.
Leave a comment