-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/two way scaling #104
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enjoy these comments, I'd like to take another look before merge.
inverse_covariance/clean.py
Outdated
) | ||
|
||
|
||
def twoway_standardize(X, axis=0, with_mean=True, with_std=True, copy=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to two_way_standardize
inverse_covariance/clean.py
Outdated
@@ -0,0 +1,323 @@ | |||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer name of file is two_way_standard_scalar.py
. If you want it to be part of a larger set of cleaning tools you could put it in a module with inverse_covariance/clean/__init__.py
but I think it's fine to leave it in here where it is (flat) and just rename it since we dont have a lot of these yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed by just changing the file names. If we want to subscope it to clean
or preprocessing
later we can but since this is the only one, keeping it flat for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*addressed when I push changes
inverse_covariance/clean.py
Outdated
raise NotImplemented( | ||
"Algorithm for sparse matrices currently not supported.") | ||
|
||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for else here since you are raising, can bump indention to the left for this whole block after removing else
inverse_covariance/clean.py
Outdated
Xrow_polish = scale(Xcol_polish.T, axis=1, | ||
with_mean=True, with_std=with_std) | ||
n_iter += 1 | ||
err_norm_row = np.linalg.norm(oldXrow-Xrow_polish, 'fro') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add space between variables and operations oldXrow - Xrow_polish
inverse_covariance/clean.py
Outdated
with_mean=True, with_std=with_std) | ||
n_iter += 1 | ||
err_norm_row = np.linalg.norm(oldXrow-Xrow_polish, 'fro') | ||
err_norm_col = np.linalg.norm(oldXcol-Xcol_polish, 'fro') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add space between variables and operations
inverse_covariance/clean.py
Outdated
print('Input is sparse') | ||
raise NotImplementedError( | ||
'Algorithm for sparse matrices currently not supported.') | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for else after raise or return
inverse_covariance/clean.py
Outdated
else: | ||
raise NotImplementedError( | ||
'Two Way standardization not reversible with accuracy') | ||
X = np.asarray(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code will never execute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure there is a test that tests each of these methods, that would have picked this up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this should be a warning not an exception.
sqcov_cols = np.diag(np.sqrt(var_cols)) | ||
return mu + sqcov_rows * X * sqcov_cols | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some tests for two_way_standardize that test that it does what's expected, not just the interface? Like with a fixed input and expected output.
from inverse_covariance.clean import ( | ||
twoway_standardize | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some tests for the TwoWayStandardScaler
with fixed inputs and expected outputs that run over a number of options and ensure it's right. Also the standard sklearn check_estimator test (see example here: https://github.com/skggm/skggm/blob/develop/inverse_covariance/tests/common_test.py)
inverse_covariance/clean.py
Outdated
""" | ||
check_is_fitted(self, 'row_scale_') | ||
|
||
copy = copy if copy is not None else self.copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think can be shortened to copy = copy or self.copy
Alright, I've reworked this a bit, I am going to leave some questions for you @mnarayan inline. |
|
||
return mu + sqcov_rows * X * sqcov_cols | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mnarayan please review all of the tests in the file to see if its behaving as you would expect. You can add more tests to the parameterize dectorator if you know of good specific small examples.
|
||
# Q: This doesnt seem to actually get used in the transform, only in | ||
# the inverse transform which it sounds like we should not support. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mnarayan The values computed in the fit/partial_fit routines do not get used in the two_way_standard_scaler routine, only in the "inverse transform" method. Do we actually need them or are they just cribbed from sklearn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mnarayan did you see this question about partial fit? If these variables are not needed, we also don't need partial_fit, can simplify everything, and have this be a transform-only transformer.
estimator=self, | ||
) | ||
n_rows, n_cols = np.shape(X) | ||
if self.n_cols_ != n_cols: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check was required to pass the sklearn check_estimator
suite. The main idea is that we throw a value error if the dimention of the transform data is not the same as that of the fit method. Does this seem right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this sounds good.
|
||
# Q: Should ^ be a warning or should we just rais here and delete the | ||
# rest of the code? | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed your exception to a warning and then the remaining code is the same as you had implemented. Is it desirable to retain this method or should it be removed?
TODO: tests for this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the use case of inverse_transform really and only added what would be a reasonable one-step approximation to maintain consistency with API https://github.com/scikit-learn/scikit-learn/blob/f0ab589f/sklearn/preprocessing/data.py#L697.
Is it preferable to just create a function that does nothing with a warning that this functionality is not supported?
I think it is better to add approximate solution with some tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I can leave the interface, raise so that it breaks but says why, and delete the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its preferable not to provide an approximate solution if the utility is interface compatibility with a different utility-- it means maintenance for us and footguns for anyone using it.
There is a use case for fit/partial_fit that are not a part of transform.
For instance, I want to know the row/column means in one dataset like
training samples that I later want to use to transform the test set, I may
want to retrieve the mean/variance parameters only without applying
transform.
…On Mon, Sep 10, 2018 at 1:33 PM Jason Laska ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In inverse_covariance/two_way_standard_scaler.py
<#104 (comment)>:
> + copy=self.copy,
+ warn_on_dtype=True,
+ dtype=FLOAT_DTYPES,
+ estimator=self,
+ ensure_min_features=2,
+ )
+ if sparse.issparse(X):
+ raise NotImplemented(
+ "Input is sparse: Algorithm for sparse matrices currently not supported."
+ )
+
+ self.n_rows_, self.n_cols_ = np.shape(X)
+
+ # Q: This doesnt seem to actually get used in the transform, only in
+ # the inverse transform which it sounds like we should not support.
+
@mnarayan <https://github.com/mnarayan> did you see this question about
partial fit? If these variables are not needed, we also don't need
partial_fit, can simplify everything, and have this be a transform-only
transformer.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#104 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAov9wYmLQTSNQiq6LyhLvSj6UYMwUfuks5uZsx9gaJpZM4Px-2T>
.
|
Ok so then as is, it seems good. @mnarayan in order to complete this branch I need the following things from you:
You should be able to just Be sure to run |
Implements two-way centering and scaling of an input data matrix using successive normalization. Important for unbiased estimation of second-order dependence i.e. correlation matrices.
twoway_standardize
implements the successive normalizationTwoWayStandardScaler
is a class analogous toStandardScaler
but stripped down, without support for sparse matrices or a functional inverse_transform. Key functionality isfit
which returns both row and column means, variance, standard deviation estimates, andtransform
which merely invokessuccessive_normalization
.fit
are not identical to the ones needed for iterative centering/scaling. Thus, inefficient in that it does not take advantage of precomputed fit due to the need for iterative estimation.Reference
"Successive Normalization of Rectangular Arrays" by Olshen and Rajaratnam
Ann. Statist. Volume 38, Number 3 (2010), 1638-1664.
https://projecteuclid.org/euclid.aos/1269452650
twoway_standardize