Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Fix #10229: check_array should fail if array has strings #10495

Merged
merged 13 commits into from
Feb 22, 2018

Conversation

rtlee9
Copy link
Contributor

@rtlee9 rtlee9 commented Jan 18, 2018

Reference Issues/PRs

Fixes #10229

What does this implement/fix? Explain your changes.

Adds a deprecation warning if a non-object string-like array (defined as being a subdtype of np.flexible) is passed to check_array with dtype="numeric". Arrays with object, boolean, and number dtypes are handled as before.

Any other comments?

The added deprecation warning indicates that non-object string-like arrays will be handled as object arrays are currently handled, i.e., attempted to be converted to np.float64. It seems intuitive to me to treat all string-like arrays (object + flexible) the same, but please let me know if you prefer any alternatives.

For reference: numpy scalar dtypes

@rtlee9 rtlee9 changed the title [WIP] Fix #10229: check_array should fail if array has strings [MRG] Fix #10229: check_array should fail if array has strings Jan 18, 2018
# in the future np.flexible dtypes will be handled like object dtypes
if dtype_numeric and np.issubdtype(array.dtype, np.flexible):
warnings.warn(
"In the future, array with dtype {} will be handled as object,"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plese follow the conventions of the message given in http://scikit-learn.org/dev/developers/contributing.html#deprecation.

Unless I miss something you should say explicitly in the message that it will cause an error (rather than handled as objects which probably does not mean much to the end user).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example where I think it would make sense to handle in the same way as an object array rather than raise an error:
let X_str_num = [['1', '2'], ['3', '4']] then check_array(np.array(X_str_num, 'O'), 'numeric') returns an array with dtype float64 but check_array(X_str_num, 'numeric') and check_array(np.array(X_str_num, 'U1'), 'numeric') return an array with dtype U1.

This behavior for object arrays is expected base on the docstring, so we would just have to expand that description from object to object + flexible at the end of the deprecation cycle if we do want to treat flexible arrays the same as objects (e.g., attempt to cast to numeric).

If "numeric", dtype is preserved unless array.dtype is object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To align the deprecation warning message with the guidelines, I've added the version the change in behavior is to be expected to the warning in commit 4bcfc92. I'm not sure if we need to indicate the current version, since no changes have actually been made this version. For this reason I also haven't added a deprecation note to the docstring, but please let me know if you'd suggest otherwise.

@@ -245,6 +245,22 @@ def test_check_array():
result = check_array(X_no_array)
assert_true(isinstance(result, np.ndarray))

# deprecation warning if string-like array with dtype="numeric"
X_str = [['a', 'b'], ['c', 'd']]
assert_warns(DeprecationWarning, check_array, X_str, "numeric")
Copy link
Member

@lesteve lesteve Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the warning message through assert_warns_message (here and everywhere else in this PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lesteve, fixed in commit 50cd194

@lesteve
Copy link
Member

lesteve commented Jan 19, 2018

Here's an example where I think it would make sense to handle in the same way as an object array rather than raise an error:
let X_str_num = [['1', '2'], ['3', '4']] then check_array(np.array(X_str_num, 'O'), 'numeric') returns an array with dtype float64 but check_array(X_str_num, 'numeric') and check_array(np.array(X_str_num, 'U1'), 'numeric') return an array with dtype U1.

IMO this is all these weird string to float silent conversions that we want to disable in the future. I was not aware that numpy could convert strings to float64 like this:

import numpy as np
np.array(['1', '222']).astype(np.float64)

I could not find a casting argument that allowed objects to float64 only if there were numbers in the array, please double-check. There may be an alternative approach.

To sum up what I think we should aim for in the future (after a 2 release deprecation cycle):

check_array(np.array(['1', '2'], dtype='O'), dtype='numeric') # error
check_array(np.array(['1', '2'], dtype='S1'), dtype='numeric') # error
check_array(np.array(['1', '2'], dtype='U1'), dtype='numeric') # error
check_array(np.array([1, 2], dtype='O'), dtype='numeric') # no error

@rtlee9
Copy link
Contributor Author

rtlee9 commented Jan 20, 2018

I couldn't find a casting argument in the documentation. I took another look at the code as well, and it looks like an array with dtype 'O' is set to be cast to np.float64 in this line, conditional only on being an object type and the dtype argument being numeric.

On the other hand there is a warn_on_dtype argument, which raises a DataConversionWarning if the dtype of the input array doesn't match the dtype of the returned array. Having this argument makes me feel less less weary of the automatic string -> numeric conversions, as the user has the option of being warned when this happens.

@rtlee9
Copy link
Contributor Author

rtlee9 commented Jan 20, 2018

Took a look at your examples as well - it sounds like you're suggesting we look at the elements in a numpy array to determine what the underlying type is as opposed to relying on the np.array.dtype attribute.

I haven't found a good way of doing that without checking the type of potentially every element in an array yet; for example with np.array([[1, 2, 3, '4']], dtype='O') we wouldn't know that there's a string element in this array (i.e., in order to know whether to throw an error in check_array) without iterating through it, right?

I'd have to do some more digging if this is the route we want to go -- seems like something that should be handled by numpy ideally, but I'm not sure. Should this functionality be opened as a separate issue perhaps?

@lesteve
Copy link
Member

lesteve commented Jan 22, 2018

It seems like .astype(casting='safe') does what we want to do in the future. Maybe we can have something like:

try:
    array = array.astype(dtype=np.float64, casting='safe')
except TypeError:
    warnings.warn('...') 
    array = array.astype(dtype=np.float64, casting='unsafe')

@rtlee9
Copy link
Contributor Author

rtlee9 commented Jan 24, 2018

Sounds good with the exception that object arrays even with no strings cannot be cast to float with casting='safe':

np.array([1, 2], dtype='O').astype(np.float64, casting='safe')  # raises TypeError

@lesteve
Copy link
Member

lesteve commented Jan 24, 2018

Maybe object -> float64 conversion is a conversion that we want to disable in the future, not 100% sure. If we are fine with silently converting object to float64, we can have a special case in the code.

@rtlee9
Copy link
Contributor Author

rtlee9 commented Jan 26, 2018

Agreed. In my opinion these silent conversions aren't completely unacceptable, especially given there's a warn_on_dtype argument to warn for this kind of thing.

I think this is a step in the right direction at least -- no more silently returning a string array when numeric is requested. Any additional suggestions for this PR?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I suspect this is being unnecessarily cautious (because the current behaviour is a bug, or an oversight), I'm okay with this fix-slowly strategy.

I think that the error message should be improved: "Arrays of strings will be interpreted as decimal numbers" rather than "will be handled as arrays with dtype object". I think we should (in the future?) outright reject arrays of bytes objects.

@jnothman
Copy link
Member

Btw I've briefly checked test and example build logs to ascertain that we're not currently raising this warning anywhere.

@lesteve
Copy link
Member

lesteve commented Jan 29, 2018

I am looking at this with fresh eyes and I am wondering actually why we can not have an exception (without deprecation period) when dtype.kind is in ('S', 'U') (which covers arrays of bytes/str/unicode in Python 2/3). I would think that even if check_array lets through an array like np.array(['1', '2']), an error will be raised when this array actually gets used.

Maybe we should have an error as well for dtype.kind == 'V', in which case np.issubdtype(..., np.flexible) (as used in this PR at the time of writing) is a good way of doing grouping dtype.kind in ('S', 'U', 'V').

@jnothman
Copy link
Member

jnothman commented Jan 29, 2018 via email

@rtlee9
Copy link
Contributor Author

rtlee9 commented Jan 30, 2018

These examples suggest at least some estimators don't raise an error for decimal strings

from sklearn.linear_model import LogisticRegression
LogisticRegression().fit([['1', '1'], ['0', '1']], [0, 1])  # no error
LogisticRegression().fit(np.array([['1', '1'], ['0', '1']], dtype='O'), [0, 1])  # no error
LogisticRegression().fit(np.array([['1', '1'], ['0', '1']], dtype='U'), [0, 1])  # no error

from sklearn.svm import SVC
SVC().fit([['1', '1'], ['0', '1']], [0, 1])  # no error
SVC().fit(np.array([['1', '1'], ['0', '1']], dtype='O'), [0, 1])  # no error
SVC().fit(np.array([['1', '1'], ['0', '1']], dtype='U'), [0, 1])  # no error

I think the estimators call check_X_y from their fit method with dtype=np.float64, which calls check_array with the same dtype. check_array then attempts to convert to np.foat64 and raises a ValueError if it can't. As it's being used now it seems like the logical place for raising a decimal string error would be in check_array so I'm not too surprised it isn't raised elsewhere in the estimator.

@jnothman
Copy link
Member

jnothman commented Jan 30, 2018 via email

@rtlee9
Copy link
Contributor Author

rtlee9 commented Feb 2, 2018

It does look like the use of check_X_y / check_array at least is inconsistent across estimators; for example, MLP doesn't pass a dtype, while logistic regression, SVM, and GBR use float32 or float64.

I would rather we didn't support strings at all, but I'd prefer to do so than to support in some and not others.

^^ Makes sense to me.

@jnothman
Copy link
Member

jnothman commented Feb 3, 2018 via email

@rtlee9
Copy link
Contributor Author

rtlee9 commented Feb 16, 2018

Got it. So is there anything else I can do for this pull request? If I'm interpreting your comments correctly, we don't want to reject decimal strings in check_array quite yet.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But otherwise, yes, this is reasonable

X_byte = [[b'a', b'b'], [b'c', b'd']]
assert_warns_message(
DeprecationWarning,
"arrays of strings will be interpreted as decimal numbers "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should aim to reject, rather than interpret, bytes

@jnothman jnothman changed the title [MRG] Fix #10229: check_array should fail if array has strings [MRG+1] Fix #10229: check_array should fail if array has strings Feb 20, 2018
NumPy may convert string list to dtype U1 or S1 so need to accept
both deprecation warnings
@lesteve
Copy link
Member

lesteve commented Feb 20, 2018

Maybe I am missing something but this feels like a very partial solution to the original problem:

  • this only applies to dtype='numeric'. A lot of the estimators specify dtype (hacky awk-based search seems to indicate ~40% of the check_array calls specify dtype outside tests).
  • I am not really sure why we treat string arrays differently than byte arrays. I think we should strive to raise an error for both eventually.
  • I really don't understand the warning ""Beginning in version 0.22, arrays of strings will be interpreted as decimal numbers if parameter 'dtype' is numeric". I think arrays of strings are already (and probably have been for some time) interpreted as decimal numbers.

@jnothman
Copy link
Member

jnothman commented Feb 20, 2018 via email

@lesteve
Copy link
Member

lesteve commented Feb 20, 2018

OK that makes sense, thanks a lot for the details ! Somehow in my head I wanted this PR to be about silent conversion of strings/bytes to numbers ... I'll try to review in more details to move this PR forward.

@jnothman
Copy link
Member

I'm happy to deprecate decimal number string support, but we need to do so in the dtype='float' case too

@lesteve
Copy link
Member

lesteve commented Feb 21, 2018

I have looked at this a bit more and I would be in favour of doing the following things:

  • not making a difference between arrays of strings and arrays of bytes. I would do the same as numpy which can convert both fine to float64 (for better or worse). Making a difference between array of strings and array of bytes also introduces a slight discrepancy between dtype='numeric' and dtype=np.float64
  • the warning should probably be a FutureWarning (rather than a DeprecatingWarning)
  • the warning needs an advice on how to get rid of the warning (basically it is strongly recommended to use .astype(np.float64) or .astype(np.float32) before feeding it into scikit-learn)
  • possibly adding a whats_new entry

@jnothman
Copy link
Member

Let's go with that plan. Sorry @rtlee9 for the back and forth. A what's new is a good idea because a few users reported this as a bug

@rtlee9
Copy link
Contributor Author

rtlee9 commented Feb 22, 2018

No worries -- these all sounded like good suggestions to me. I've added a few commits (one per item in @lesteve 's list) and merged in master to remove conflicts in the whats_new entry. Please let me know if you have any edits, especially around warning/whats_new messaging.

Utils

- Fixed a bug in :func:`utils.validation.check_array` to raise a ``FutureWarning``
indicating that arrays of type ``np.flexible`` will be interpreted as decimal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just say strings

# in the future np.flexible dtypes will be handled like object dtypes
if dtype_numeric and np.issubdtype(array.dtype, np.flexible):
warnings.warn(
"Beginning in version 0.22, arrays of strings will be "
Copy link
Member

@lesteve lesteve Feb 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard user will very likely have little idea what check_array is. This is the wording I am proposing, better suggestions more than welcome:

"Beginning in version 0.22, arrays of bytes/strings will be "
"interpreted as decimal numbers if dtype='numeric'. "
"It is recommended that you convert the array to "
"a float dtype before using it in scikit-learn, "
"for example by using "
"your_array = your_array.astype(np.float64)."

@lesteve
Copy link
Member

lesteve commented Feb 22, 2018

I pushed a tweak in the whats_new entry and in tests. The wording of the warning can maybe be improved but otherwise LGTM.

@jnothman
Copy link
Member

I'd like to hope not many people see the warning. Thanks @rtlee9

@jnothman jnothman merged commit 7108d17 into scikit-learn:master Feb 22, 2018
@rtlee9 rtlee9 deleted the check_array branch February 22, 2018 17:16
@rtlee9
Copy link
Contributor Author

rtlee9 commented Feb 22, 2018

Welcome, and thank you both for the reviews

@amueller
Copy link
Member

I'm confused as to what this will do in the future. I don't understand the message :-/

@amueller
Copy link
Member

amueller commented Jul 16, 2018

This warning happens in feature selection when we try to transform the feature names btw. via #11570.

@amueller
Copy link
Member

so the future behavior is to convert them to float? Can we change "interpret" to "convert"?

@amueller
Copy link
Member

Also we should be clearer what the current behavior is, which is passing through arbitrary strings if dtype='numeric'.

@jnothman
Copy link
Member

jnothman commented Jul 17, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

check_array(X, dtype='numeric') should fail if X has strings
4 participants