-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect linear regression results #426
Comments
I forgot to say that I am using Julia 1.5.3 on Linux and 1.5.4 on Windows. Both produced same results. |
@nalimilan, could you have a quick look at the issue and confirm or reject that it is a problem of GLM? |
I also started a discussion on Discourse: https://discourse.julialang.org/t/can-someone-replicate-this-glm-problem-with-linear-regression-on-your-computer/60098 rafael.guerra found out that by subtracting 2000 from Year, the regression was estimated correctly. He wonders if the problem was due to a normalization error. I think now it becomes more likely that the problem was within GLM.jl. I’m teaching a course that uses basic regression and conducts simple forecasting. Subtracting 2000 will solve the problem. However, it also means that when forecasting, one needs to pay attention to this as well, which is unnecessarily complicating the issue. And this is a part of a take-home exam. So it will not be very welcomed by students to have this extra complexity. I hope that the GLM’s maintainer or someone who knows better Julia-fu can fix the issue and publish a newer version of GLM, so that I can simply ask students to update the GLM package. This will be a better and simpler solution, I think. |
As mentioned on discourse, this stems from Oddly, it doesn't fail with
|
Thank you. This indeed solved the issue. I will prepare a note for students so that they are aware. As I have commented on Discourse, I’m not sure if this is a sane option to set |
As you have mentioned on Discourse, Stata implements |
The cause is this line: Line 107 in 16268be
I don't understand cholesky factorization well enough. It's odd that |
Just curious. How do you debug in Julia? Is there a track function? Or do you just print a lot of intermediate results to see where it went wrong? |
In this instance I had a feeling and knew the library well enough to know what functions were being called. But for debugging I generally use |
Thank you! |
I'd like to try to help. I don't fully understand the code. Could you rewrite
A ! is added after What does the And what does the |
There are two distinct misunderstandings here. The first is Julia syntax. The syntax
is the same as
so in this case The The next misunderstanding is what's going on mathematically. To better understand why we call |
It seems that the default tolerance ( julia> X = results.mm.m;
julia> F = Hermitian(float(X'X));
julia> cholesky(F, Val(true), tol=1e-5, check=true)
CholeskyPivoted{Float64, Matrix{Float64}}
U factor with rank 3:
3×3 UpperTriangular{Float64, Matrix{Float64}}:
3.6764e5 18721.7 9.31674
⋅ 9036.63 4.49425
⋅ ⋅ 0.00369674
permutation:
3-element Vector{Int64}:
3
2
1
julia> cholesky(F, Val(true), tol=1e-4, check=true)
ERROR: RankDeficientException(1)
[...]
julia> cholesky(F, Val(true), tol=-1, check=true)
ERROR: RankDeficientException(1)
[...] The default tolerance is defined like this:
|
That code seems to use the QR decomposition, so do you think we can compare the tolerance? |
Tha's a good point. Let's try They define their own cholesky factorization here. But their
There's some sort of scaling that goes on such that the keyword argument passed to I wonder if it's worth pinging Laurent on this. He's been saying he's doing more things in Julia. |
Thank you. While I am not able to fully follow the technical discussion, I am grateful to see the progress. I think part of the issue is from the fact that the Year variable in my exam dataset is quite invariant. The values are very close to the mean and does not change more than around 1% of the mean. This triggers the question of how much Below I create a new variable that follows a normal distribution, which has a larger mean than Year and smaller deviation as compared to the mean.
Could you test whether an even smaller Is What is the practical implication of changing I am wondering if it is possible to release a new version of GLM.jl, with |
While changing the I would suggest creating a variable of years since 2008, i.e. This is a bummer of a bug, though. It's unfortunate that we are uncovering this behavior, but thank you for bringing it to our attention. |
Digging into the |
@pdeffebach That threshold stems from the days when that was the threshold in Changing the tolerance won't always fix the problem here though because if you're that ill conditioned, then you may or may not hit the threshold on different architectures. (Floating point math isn't associative so e.g. using AVX and other SIMD instructions can change the result. This led to intermittent test failures in MixedModels depending on whether or not the test was running on a machine with AVX512 support, which is what motivated the discussion that I linked.) As mentioned in #375, the extra precision from the QR decomposition can sometimes help, albeit at a slight performance penalty. But the best choice here is an affine transformation of |
All that said, I think we should probably
|
I agree with @palday here. In particular, I believe that age is a more meaningful variable here. It doesn't change the general issue though. Numerical rank determination is not clear-cut. I just want to mention that the QR wouldn't make a difference for this particular problem julia> cond(cholesky(Symmetric(X'X), Val(true)).U)
9.957869521461289e7
julia> cond(qr(X, Val(true)).R)
9.957869522989982e7 Notice that both correspond to reduced rank with respect to the R tolerance |
The numerical concerns are important, but I'm not super happy with a situation where Could we change the tolerance to This problem was caused by me, making Is there some other heuristic we can use to determine rank deficiency outside of Cholesky? Of course, we can always revert |
@pdeffebach The numerical issues won't go away. Poorly conditioned models will often have misleading standard errors and other pathologies. I'm pretty strongly opposed to making things "too easy" because there will always be cases where these things come up the surface. Model diagnostics (does my model seem to actually fit my data? did something go wrong in fitting the model?) and centering and scaling covariates should be applied stats 101. Users don't have to know how our internal numerics work, but they should understand that numerical issues can and do arise in real-world statistical problems and that there are ways to check for false convergence (diagnostics) and ways to help the computer (centering and scaling). Note in your historical counterexample, there's a problem: |
QR with Pivot, already discussed above. Singular Value Decomposition, but this is expensive. We could look at the condition number to guess whether or not a given calculation is likely to be ill-conditioned, but that's also an extra non trivial operation and doesn't tell what to do, just that whatever we do, there might be problems. We have a whole page explaining how numerical rank deficiency crops up in mixed models and how rank deficiency is well defined in theory but really, really challenging in practice: https://juliastats.org/MixedModels.jl/stable/rankdeficiency/ |
It might be a good idea to benchmark the singular value decomposition. As an iterative method it is definitely more expensive than direct methods like Cholesky and QR but some of the recent SVD methods are surprisingly fast. The point is that the condition number is a property of the singular values. @andreasnoack may be able to offer more informed commentary on the SVD methods that are available. I haven't really kept up. |
@andreasnoack I'm not equipped to do a full numerical instability analysis with R's algorithm, but here is a full MWE fo you. R (from
and Julia
I don't understand what |
@pdeffebach It appears that your R code is not using the Cholesky factorization in R but a specific method, which from the name I expect is coded in C++, called |
yeah, I know. It felt like a decent benchmark because |
I'm also a bit worried that we fail where R succeeds. As this report shows, people are going to blame us because of this. Do we have any idea whether this particular dataset is very special or whether we can expect frequent failures like this? Maybe when |
I hadn't read the whole thread of this issue. I think fixest is a distraction and we should concentrate on the methods in base R if we are going to compare to the GLM results. There is a difference between using Cholesky and QR to solve a least squares problem because the Cholesky involves creating X'X and QR works directly on X, as does SVD. The condition number of X'X is the square of the condition number of X so there is immediately a disadvantage in using Cholesky and it will affect the estimate of the rank. Having said that, I will point out that there is no reliable method of determining the rank of an arbitrary matrix numerically. In floating point arithmetic you simply can't do it because arbitrarily small perturbations can change the answer. The best you can do is to come up with a scheme for handling numerical rank deficiencies and document it. This is an ill-conditioned problem, which means that different systems will give different answers regarding the number of estimable coefficients. (Standard techniques such as mean centering the regressor variables will help,) The condition number of the X matrix is about 10^8 and the condition number of X'X is about 10^16 which means it is computationally singular. julia> using CSV, DataFrames, LinearAlgebra, Statistics
julia> data = CSV.read("/home/bates/Downloads/Car-Training.csv", DataFrame);
julia> X = hcat(ones(nrow(data)), data.Year, data.Mileage);
julia> cond(X)
9.957869522988558e7
julia> cond(Symmetric(X'X))
9.915916546575758e15 By comparison, if you mean-center the Xcen = hcat(ones(nrow(data)), data.Year .- mean(data.Year), data.Mileage .- mean(data.Mileage))
julia> cond(Xcen)
21489.22367695712
julia> cond(Symmetric(Xcen'Xcen))
4.617867342370313e8
It is unhelpful to use expressions like "succeed" or "fail" in these circumstances. This is an ill-conditioned problem and there is no "correct" answer. |
If it is felt that we should emphasize numerical accuracy over speed then I would be willing to work on a rewrite of the package to use a QR factorization by default. We have various methods in MixedModels to try to detect and handle rank deficiency but none of them are foolproof. |
Right. Let me phrase it differently: people are less likely to complain if estimation returns a finite coefficient for each variable when two variables are near-collinear (however meaningless coefficients may be). They are used to checking for collinearity issues using various indicators. And more importantly they can drop one of the problematic columns and estimate another model. But if we silently drop a variable like we do here, we are more likely to annoy them, and the solution is less trivial than dropping one of the predictors. So I'd rather err on the side of assuming the matrix is full rank.
Favouring accuracy by default indeed sounds like a reasonable approach to me, and it doesn't prevent making it easy to use faster methods. Though I can't really comment on the details TBH. |
It's worth noting that This may be a naive suggestion, but is there a reason we can't calculate what LAPACK's precision would be, and then just divide that by 10? |
For the record, in That said, in case of an ill conditioned model, the variables removed do differ between Just my 2 cents: I stand with the general opinion that having the same variables removed to avoid collinearity is not possible across software. Different algorithms imply different solutions (QR vs Cholesky + different ways to pivot and tolerance levels), so the only way to have the same results would be to apply the same algorithms across software, which is not desirable (since the liberty of implementation would be lost). And anyway, it's always the user's responsibility to provide a proper model. |
In SAS, you get a warning when using this type of data with proc reg data=work.cars ;
model Price = Year Mileage;
run;
WARNING: The range of variable Year is so small relative to its mean that there may be loss of accuracy in the computations. You may need to rescale the variable to have a larger value of RANGE/abs(MEAN), for example, by using PROC STANDARD M=0; I like that it is only a warning. The suggestion is not too prescriptive as other options could be good depending on the problem. with proc glm data=work.cars;
class Year;
model Price = Year Mileage / solution;
run; My view is we should transparently indicate that the problem is ill-conditioned when it is the case (like here) and provide some diagnostic if we have some. |
Thank you all for the discussion. I have not been able to make time to look into the technical details. I agree that offering diagnostics or hints well be very helpful, from the perspective of a user. |
I think I just got bit by this issue as well. Not sure if this adds to the discussion at all, it's a bit above my head, but my issue seems to happen with moderately large numbers (on the order of
The MWE has code to run in |
I'm having the same problem, in a regression with ~100 regressors ranging between 0 and 1, and ~9 million observations. That is, I have a linear regression that is running fine in R and Stata, but not lm(). I understand that the data matrix may be nearly degenerate. But the regression is also meaningful, what is called an event study for a difference-in-differences design. So for me this problem is a deterrent to using GLM at all. I wonder if Cholesky vs QR decomposition is the key difference in my case? |
Would you be able to share the data or some artificial data that causes the same issue? Otherwise, it's hard to debug. Also, could you elaborate on what "incorrect" and "running fine" mean here? If the matrix is nearly singular then it's hard to conclude anything just from the coefficients. Are the predicted values off? |
I'm sorry to have been vague. I shouldn't post the data publicly, and it's a big file, but I could email it to someone. I am reasonably confident that the issue is the imprecision of the Cholesky decomposition--the design matrix is extremely ill-conditioned--and the case that this is an issue seems to have already been made convincingly above. Moreover, after I posted this, I learned that a commit has been made to add a |
Results from the same regression with lm() and with
|
I have now also encountered this issue and reproduced a minimum working example as it seems the conversation has died down without the underlying issue being fixed. I've attached a csv with data; in my case, I'm working with chunks of 300 data points at a time.
This produced the following graph. While the work-around of setting |
The model matrix is poorly conditioned, as your X values are all between 76383 and 76682. You are trying to estimate an intercept that is very far away from the observed range of the X values, which leads to collinearity - just as it did three years ago.
What do you want to do with the model that you would fit and can it be accomplished by centering the X values? |
I primarily want to extract the slope from the fit as the rate of change is the relevant parameter to other analysis I'm doing (I'm being a little bit intentionally vague here) so I'd expect centering the X values would work actually. Let me give it a shot and I'll report back. |
Indeed, centering the data also provides the proper slope coefficient. With two possible workarounds ( |
I would tend to prefer centering as long as it still gives you coefficients with the desired interpretation. We've been a bit slow wrapping up a few loose ends, but I'll try to allocate some time in the next few weeks to get a release with the new QR-based solver out -- it's a bit slower but more numerically stable. |
Hmm, I am tempted to try and understand this issue and add the warnings like they have in SAS and give the user concrete advice here. |
Car-Training.csv
This is a bit urgent.
In a home exam that it is ongoing, I have the attached dataset, where students need to estimate models.
Here are the Julia codes:
The output is the following:
The intercept is not estimated. The other two coefficients are not correct.
I tried R, SPSS, and Excel. All gave the same results that are different from Julia. I post the results from R below:
Did I do something wrong here? Or is there an issue with GLM? If there is an issue with GLM, can a new version be published as soon as possible, so that I can notify the students who are taking this exam right now?
The text was updated successfully, but these errors were encountered: