-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collinearity Speed Up v2 #421
Conversation
…lin_prob added comment
…lin_prob making it faster
Sorry for the late response, is there any reason you switched back to VIFs from VDFs? |
Yeah, I took a closer look at the tests, and realized that in cases of perfect collinearity, the VDPs are uniformly split between the columns. This creates a problem if you have more than 3 perfectly collinear columns. Initially I thought that nothing is ever perfectly collinear so it doesn't matter, but then I realized that the test situation is highly plausible especially if someone doesn't know what they're doing or doesn't understand their data. If you use VDPs, I think its best to use them in conjunction with another indicator of collinearity. The VDP formula and VIF formula do have a lot in common, so it could be wise to use them in conjunction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks great! Once you resolve my comments I can approve for merging, or if you want to investigate a joint approach using VDPs you can add on as well.
mage_ai/data_cleaner/cleaning_rules/remove_collinear_columns.py
Outdated
Show resolved
Hide resolved
mage_ai/data_cleaner/cleaning_rules/remove_collinear_columns.py
Outdated
Show resolved
Hide resolved
mage_ai/tests/data_cleaner/cleaning_rules/test_remove_collinear_columns.py
Show resolved
Hide resolved
mage_ai/tests/data_cleaner/cleaning_rules/test_remove_collinear_columns.py
Show resolved
Hide resolved
mage_ai/data_cleaner/cleaning_rules/remove_collinear_columns.py
Outdated
Show resolved
Hide resolved
@ChadGueli @skunichetty is this still good to go? |
Yes, it should be. |
It's been a while since I last looked at this, but it should be good - i'll give it the go ahead. |
Summary
Again, this is a major speed up. It eliminates recomputation along rows, saving a huge amount of time.
The VIFs are computed using the correlation formula, but optimized with only the diagonal inverted.
Tests
It passes all of the unittest tests. But the
test_vif_calculation
test is removed. I would be happy to add another that checks the current formulation.cc:
@tommydangerous @skunichetty