-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] Fix #10229: check_array should fail if array has strings #10495
Conversation
sklearn/utils/validation.py
Outdated
# in the future np.flexible dtypes will be handled like object dtypes | ||
if dtype_numeric and np.issubdtype(array.dtype, np.flexible): | ||
warnings.warn( | ||
"In the future, array with dtype {} will be handled as object," |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plese follow the conventions of the message given in http://scikit-learn.org/dev/developers/contributing.html#deprecation.
Unless I miss something you should say explicitly in the message that it will cause an error (rather than handled as objects which probably does not mean much to the end user).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example where I think it would make sense to handle in the same way as an object array rather than raise an error:
let X_str_num = [['1', '2'], ['3', '4']]
then check_array(np.array(X_str_num, 'O'), 'numeric')
returns an array with dtype float64 but check_array(X_str_num, 'numeric')
and check_array(np.array(X_str_num, 'U1'), 'numeric')
return an array with dtype U1.
This behavior for object arrays is expected base on the docstring, so we would just have to expand that description from object to object + flexible at the end of the deprecation cycle if we do want to treat flexible arrays the same as objects (e.g., attempt to cast to numeric).
If "numeric", dtype is preserved unless array.dtype is object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To align the deprecation warning message with the guidelines, I've added the version the change in behavior is to be expected to the warning in commit 4bcfc92. I'm not sure if we need to indicate the current version, since no changes have actually been made this version. For this reason I also haven't added a deprecation note to the docstring, but please let me know if you'd suggest otherwise.
@@ -245,6 +245,22 @@ def test_check_array(): | |||
result = check_array(X_no_array) | |||
assert_true(isinstance(result, np.ndarray)) | |||
|
|||
# deprecation warning if string-like array with dtype="numeric" | |||
X_str = [['a', 'b'], ['c', 'd']] | |||
assert_warns(DeprecationWarning, check_array, X_str, "numeric") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the warning message through assert_warns_message (here and everywhere else in this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is all these weird string to float silent conversions that we want to disable in the future. I was not aware that numpy could convert strings to float64 like this: import numpy as np
np.array(['1', '222']).astype(np.float64) I could not find a To sum up what I think we should aim for in the future (after a 2 release deprecation cycle): check_array(np.array(['1', '2'], dtype='O'), dtype='numeric') # error
check_array(np.array(['1', '2'], dtype='S1'), dtype='numeric') # error
check_array(np.array(['1', '2'], dtype='U1'), dtype='numeric') # error
check_array(np.array([1, 2], dtype='O'), dtype='numeric') # no error |
I couldn't find a On the other hand there is a |
Took a look at your examples as well - it sounds like you're suggesting we look at the elements in a numpy array to determine what the underlying type is as opposed to relying on the np.array.dtype attribute. I haven't found a good way of doing that without checking the type of potentially every element in an array yet; for example with I'd have to do some more digging if this is the route we want to go -- seems like something that should be handled by numpy ideally, but I'm not sure. Should this functionality be opened as a separate issue perhaps? |
It seems like try:
array = array.astype(dtype=np.float64, casting='safe')
except TypeError:
warnings.warn('...')
array = array.astype(dtype=np.float64, casting='unsafe') |
Sounds good with the exception that object arrays even with no strings cannot be cast to float with np.array([1, 2], dtype='O').astype(np.float64, casting='safe') # raises TypeError |
Maybe object -> float64 conversion is a conversion that we want to disable in the future, not 100% sure. If we are fine with silently converting object to float64, we can have a special case in the code. |
Agreed. In my opinion these silent conversions aren't completely unacceptable, especially given there's a I think this is a step in the right direction at least -- no more silently returning a string array when numeric is requested. Any additional suggestions for this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I suspect this is being unnecessarily cautious (because the current behaviour is a bug, or an oversight), I'm okay with this fix-slowly strategy.
I think that the error message should be improved: "Arrays of strings will be interpreted as decimal numbers" rather than "will be handled as arrays with dtype object". I think we should (in the future?) outright reject arrays of bytes
objects.
Btw I've briefly checked test and example build logs to ascertain that we're not currently raising this warning anywhere. |
I am looking at this with fresh eyes and I am wondering actually why we can not have an exception (without deprecation period) when Maybe we should have an error as well for |
how about we start by writing a common test that says estimators should
reject decimal string input. If all estimators in scikit-learn already
raise an error for this case, I'm alright to do so in check_array.
Otherwise, I think a conservative approach like the present one is
sufficient for now.
|
These examples suggest at least some estimators don't raise an error for decimal strings from sklearn.linear_model import LogisticRegression
LogisticRegression().fit([['1', '1'], ['0', '1']], [0, 1]) # no error
LogisticRegression().fit(np.array([['1', '1'], ['0', '1']], dtype='O'), [0, 1]) # no error
LogisticRegression().fit(np.array([['1', '1'], ['0', '1']], dtype='U'), [0, 1]) # no error
from sklearn.svm import SVC
SVC().fit([['1', '1'], ['0', '1']], [0, 1]) # no error
SVC().fit(np.array([['1', '1'], ['0', '1']], dtype='O'), [0, 1]) # no error
SVC().fit(np.array([['1', '1'], ['0', '1']], dtype='U'), [0, 1]) # no error I think the estimators call |
so estimators currently setting dtype explicitly convert decimal strings.
weird then that we should be inconsistent across estimators. I would rather
we didn't support strings at all, but I'd prefer to do so than to support
in some and not others.
|
It does look like the use of
^^ Makes sense to me. |
Yes, some estimators require (or benefit for efficiency) from a particular
underlying float size. Others can work with any numeric, and avoid copying
the data in doing so.
|
Got it. So is there anything else I can do for this pull request? If I'm interpreting your comments correctly, we don't want to reject decimal strings in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But otherwise, yes, this is reasonable
X_byte = [[b'a', b'b'], [b'c', b'd']] | ||
assert_warns_message( | ||
DeprecationWarning, | ||
"arrays of strings will be interpreted as decimal numbers " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should aim to reject, rather than interpret, bytes
NumPy may convert string list to dtype U1 or S1 so need to accept both deprecation warnings
Maybe I am missing something but this feels like a very partial solution to the original problem:
|
I think atm with dtype='numeric' you would get an object array back, and
this would cause errors unless otherwise explicitly cast.
I object to byte arrays in python 3 being interpreted as strings because
python 3 makes a principled stand against doing so.
we are trying to maintain backwards compatibility for current uses of
dtype=float or similar, which currently eval decimal strings (and bytes:
yuck!). i don't really like this behaviour, but it's not especially
harmful, add it's what some users may expect of numpy. I'd be happy to
deprecate it, but for now this takes a more conservative approach.
Does that make some sense?
|
OK that makes sense, thanks a lot for the details ! Somehow in my head I wanted this PR to be about silent conversion of strings/bytes to numbers ... I'll try to review in more details to move this PR forward. |
I'm happy to deprecate decimal number string support, but we need to do so in the dtype='float' case too |
I have looked at this a bit more and I would be in favour of doing the following things:
|
Let's go with that plan. Sorry @rtlee9 for the back and forth. A what's new is a good idea because a few users reported this as a bug |
No worries -- these all sounded like good suggestions to me. I've added a few commits (one per item in @lesteve 's list) and merged in master to remove conflicts in the whats_new entry. Please let me know if you have any edits, especially around warning/whats_new messaging. |
doc/whats_new/v0.20.rst
Outdated
Utils | ||
|
||
- Fixed a bug in :func:`utils.validation.check_array` to raise a ``FutureWarning`` | ||
indicating that arrays of type ``np.flexible`` will be interpreted as decimal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just say strings
# in the future np.flexible dtypes will be handled like object dtypes | ||
if dtype_numeric and np.issubdtype(array.dtype, np.flexible): | ||
warnings.warn( | ||
"Beginning in version 0.22, arrays of strings will be " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The standard user will very likely have little idea what check_array is. This is the wording I am proposing, better suggestions more than welcome:
"Beginning in version 0.22, arrays of bytes/strings will be "
"interpreted as decimal numbers if dtype='numeric'. "
"It is recommended that you convert the array to "
"a float dtype before using it in scikit-learn, "
"for example by using "
"your_array = your_array.astype(np.float64)."
I pushed a tweak in the whats_new entry and in tests. The wording of the warning can maybe be improved but otherwise LGTM. |
I'd like to hope not many people see the warning. Thanks @rtlee9 |
Welcome, and thank you both for the reviews |
I'm confused as to what this will do in the future. I don't understand the message :-/ |
This warning happens in feature selection when we try to transform the feature names btw. via #11570. |
so the future behavior is to convert them to float? Can we change "interpret" to "convert"? |
Also we should be clearer what the current behavior is, which is passing through arbitrary strings if dtype='numeric'. |
interesting that transforming feature names used to work. should feature
selection use dtype=None?
|
Reference Issues/PRs
Fixes #10229
What does this implement/fix? Explain your changes.
Adds a deprecation warning if a non-object string-like array (defined as being a subdtype of
np.flexible
) is passed tocheck_array
withdtype="numeric"
. Arrays with object, boolean, and number dtypes are handled as before.Any other comments?
The added deprecation warning indicates that non-object string-like arrays will be handled as object arrays are currently handled, i.e., attempted to be converted to
np.float64
. It seems intuitive to me to treat all string-like arrays (object + flexible) the same, but please let me know if you prefer any alternatives.For reference: numpy scalar dtypes