[MRG+1] Fix #10229: check_array should fail if array has strings #10495

rtlee9 · 2018-01-18T03:11:24Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Adds a deprecation warning if a non-object string-like array (defined as being a subdtype of np.flexible) is passed to check_array with dtype="numeric". Arrays with object, boolean, and number dtypes are handled as before.

Any other comments?

The added deprecation warning indicates that non-object string-like arrays will be handled as object arrays are currently handled, i.e., attempted to be converted to np.float64. It seems intuitive to me to treat all string-like arrays (object + flexible) the same, but please let me know if you prefer any alternatives.

For reference: numpy scalar dtypes

lesteve · 2018-01-18T10:19:29Z

sklearn/utils/validation.py

+        # in the future np.flexible dtypes will be handled like object dtypes
+        if dtype_numeric and np.issubdtype(array.dtype, np.flexible):
+            warnings.warn(
+                "In the future, array with dtype {} will be handled as object,"


Plese follow the conventions of the message given in http://scikit-learn.org/dev/developers/contributing.html#deprecation.

Unless I miss something you should say explicitly in the message that it will cause an error (rather than handled as objects which probably does not mean much to the end user).

Here's an example where I think it would make sense to handle in the same way as an object array rather than raise an error:
let X_str_num = [['1', '2'], ['3', '4']] then check_array(np.array(X_str_num, 'O'), 'numeric') returns an array with dtype float64 but check_array(X_str_num, 'numeric') and check_array(np.array(X_str_num, 'U1'), 'numeric') return an array with dtype U1.

This behavior for object arrays is expected base on the docstring, so we would just have to expand that description from object to object + flexible at the end of the deprecation cycle if we do want to treat flexible arrays the same as objects (e.g., attempt to cast to numeric).

If "numeric", dtype is preserved unless array.dtype is object.

To align the deprecation warning message with the guidelines, I've added the version the change in behavior is to be expected to the warning in commit 4bcfc92. I'm not sure if we need to indicate the current version, since no changes have actually been made this version. For this reason I also haven't added a deprecation note to the docstring, but please let me know if you'd suggest otherwise.

lesteve · 2018-01-18T10:20:06Z

sklearn/utils/tests/test_validation.py

@@ -245,6 +245,22 @@ def test_check_array():
    result = check_array(X_no_array)
    assert_true(isinstance(result, np.ndarray))

+    # deprecation warning if string-like array with dtype="numeric"
+    X_str = [['a', 'b'], ['c', 'd']]
+    assert_warns(DeprecationWarning, check_array, X_str, "numeric")


Please check the warning message through assert_warns_message (here and everywhere else in this PR)

Thanks @lesteve, fixed in commit 50cd194

…meric

… in test

lesteve · 2018-01-19T07:07:19Z

Here's an example where I think it would make sense to handle in the same way as an object array rather than raise an error:
let X_str_num = [['1', '2'], ['3', '4']] then check_array(np.array(X_str_num, 'O'), 'numeric') returns an array with dtype float64 but check_array(X_str_num, 'numeric') and check_array(np.array(X_str_num, 'U1'), 'numeric') return an array with dtype U1.

IMO this is all these weird string to float silent conversions that we want to disable in the future. I was not aware that numpy could convert strings to float64 like this:

import numpy as np
np.array(['1', '222']).astype(np.float64)

I could not find a casting argument that allowed objects to float64 only if there were numbers in the array, please double-check. There may be an alternative approach.

To sum up what I think we should aim for in the future (after a 2 release deprecation cycle):

check_array(np.array(['1', '2'], dtype='O'), dtype='numeric') # error
check_array(np.array(['1', '2'], dtype='S1'), dtype='numeric') # error
check_array(np.array(['1', '2'], dtype='U1'), dtype='numeric') # error
check_array(np.array([1, 2], dtype='O'), dtype='numeric') # no error

rtlee9 · 2018-01-20T19:23:25Z

I couldn't find a casting argument in the documentation. I took another look at the code as well, and it looks like an array with dtype 'O' is set to be cast to np.float64 in this line, conditional only on being an object type and the dtype argument being numeric.

On the other hand there is a warn_on_dtype argument, which raises a DataConversionWarning if the dtype of the input array doesn't match the dtype of the returned array. Having this argument makes me feel less less weary of the automatic string -> numeric conversions, as the user has the option of being warned when this happens.

rtlee9 · 2018-01-20T19:29:23Z

Took a look at your examples as well - it sounds like you're suggesting we look at the elements in a numpy array to determine what the underlying type is as opposed to relying on the np.array.dtype attribute.

I haven't found a good way of doing that without checking the type of potentially every element in an array yet; for example with np.array([[1, 2, 3, '4']], dtype='O') we wouldn't know that there's a string element in this array (i.e., in order to know whether to throw an error in check_array) without iterating through it, right?

I'd have to do some more digging if this is the route we want to go -- seems like something that should be handled by numpy ideally, but I'm not sure. Should this functionality be opened as a separate issue perhaps?

lesteve · 2018-01-22T09:38:00Z

It seems like .astype(casting='safe') does what we want to do in the future. Maybe we can have something like:

try:
    array = array.astype(dtype=np.float64, casting='safe')
except TypeError:
    warnings.warn('...') 
    array = array.astype(dtype=np.float64, casting='unsafe')

rtlee9 · 2018-01-24T06:33:08Z

Sounds good with the exception that object arrays even with no strings cannot be cast to float with casting='safe':

np.array([1, 2], dtype='O').astype(np.float64, casting='safe')  # raises TypeError

lesteve · 2018-01-24T08:47:36Z

Maybe object -> float64 conversion is a conversion that we want to disable in the future, not 100% sure. If we are fine with silently converting object to float64, we can have a special case in the code.

rtlee9 · 2018-01-26T02:00:55Z

Agreed. In my opinion these silent conversions aren't completely unacceptable, especially given there's a warn_on_dtype argument to warn for this kind of thing.

I think this is a step in the right direction at least -- no more silently returning a string array when numeric is requested. Any additional suggestions for this PR?

jnothman

While I suspect this is being unnecessarily cautious (because the current behaviour is a bug, or an oversight), I'm okay with this fix-slowly strategy.

I think that the error message should be improved: "Arrays of strings will be interpreted as decimal numbers" rather than "will be handled as arrays with dtype object". I think we should (in the future?) outright reject arrays of bytes objects.

jnothman · 2018-01-26T03:31:02Z

Btw I've briefly checked test and example build logs to ascertain that we're not currently raising this warning anywhere.

lesteve · 2018-01-29T13:45:38Z

I am looking at this with fresh eyes and I am wondering actually why we can not have an exception (without deprecation period) when dtype.kind is in ('S', 'U') (which covers arrays of bytes/str/unicode in Python 2/3). I would think that even if check_array lets through an array like np.array(['1', '2']), an error will be raised when this array actually gets used.

Maybe we should have an error as well for dtype.kind == 'V', in which case np.issubdtype(..., np.flexible) (as used in this PR at the time of writing) is a good way of doing grouping dtype.kind in ('S', 'U', 'V').

jnothman · 2018-01-29T21:46:33Z

how about we start by writing a common test that says estimators should reject decimal string input. If all estimators in scikit-learn already raise an error for this case, I'm alright to do so in check_array. Otherwise, I think a conservative approach like the present one is sufficient for now.

rtlee9 · 2018-01-30T07:26:11Z

These examples suggest at least some estimators don't raise an error for decimal strings

from sklearn.linear_model import LogisticRegression
LogisticRegression().fit([['1', '1'], ['0', '1']], [0, 1])  # no error
LogisticRegression().fit(np.array([['1', '1'], ['0', '1']], dtype='O'), [0, 1])  # no error
LogisticRegression().fit(np.array([['1', '1'], ['0', '1']], dtype='U'), [0, 1])  # no error

from sklearn.svm import SVC
SVC().fit([['1', '1'], ['0', '1']], [0, 1])  # no error
SVC().fit(np.array([['1', '1'], ['0', '1']], dtype='O'), [0, 1])  # no error
SVC().fit(np.array([['1', '1'], ['0', '1']], dtype='U'), [0, 1])  # no error

I think the estimators call check_X_y from their fit method with dtype=np.float64, which calls check_array with the same dtype. check_array then attempts to convert to np.foat64 and raises a ValueError if it can't. As it's being used now it seems like the logical place for raising a decimal string error would be in check_array so I'm not too surprised it isn't raised elsewhere in the estimator.

jnothman · 2018-01-30T07:36:31Z

so estimators currently setting dtype explicitly convert decimal strings. weird then that we should be inconsistent across estimators. I would rather we didn't support strings at all, but I'd prefer to do so than to support in some and not others.

rtlee9 · 2018-02-02T06:51:02Z

It does look like the use of check_X_y / check_array at least is inconsistent across estimators; for example, MLP doesn't pass a dtype, while logistic regression, SVM, and GBR use float32 or float64.

I would rather we didn't support strings at all, but I'd prefer to do so than to support in some and not others.

^^ Makes sense to me.

jnothman · 2018-02-03T11:46:11Z

Yes, some estimators require (or benefit for efficiency) from a particular underlying float size. Others can work with any numeric, and avoid copying the data in doing so.

rtlee9 · 2018-02-16T06:16:12Z

Got it. So is there anything else I can do for this pull request? If I'm interpreting your comments correctly, we don't want to reject decimal strings in check_array quite yet.

jnothman

But otherwise, yes, this is reasonable

jnothman · 2018-02-16T06:29:09Z

sklearn/utils/tests/test_validation.py

+    X_byte = [[b'a', b'b'], [b'c', b'd']]
+    assert_warns_message(
+        DeprecationWarning,
+        "arrays of strings will be interpreted as decimal numbers "


I still think we should aim to reject, rather than interpret, bytes

NumPy may convert string list to dtype U1 or S1 so need to accept both deprecation warnings

lesteve · 2018-02-20T09:13:43Z

Maybe I am missing something but this feels like a very partial solution to the original problem:

this only applies to dtype='numeric'. A lot of the estimators specify dtype (hacky awk-based search seems to indicate ~40% of the check_array calls specify dtype outside tests).
I am not really sure why we treat string arrays differently than byte arrays. I think we should strive to raise an error for both eventually.
I really don't understand the warning ""Beginning in version 0.22, arrays of strings will be interpreted as decimal numbers if parameter 'dtype' is numeric". I think arrays of strings are already (and probably have been for some time) interpreted as decimal numbers.

jnothman · 2018-02-20T09:35:49Z

I think atm with dtype='numeric' you would get an object array back, and this would cause errors unless otherwise explicitly cast. I object to byte arrays in python 3 being interpreted as strings because python 3 makes a principled stand against doing so. we are trying to maintain backwards compatibility for current uses of dtype=float or similar, which currently eval decimal strings (and bytes: yuck!). i don't really like this behaviour, but it's not especially harmful, add it's what some users may expect of numpy. I'd be happy to deprecate it, but for now this takes a more conservative approach. Does that make some sense?

lesteve · 2018-02-20T10:54:10Z

OK that makes sense, thanks a lot for the details ! Somehow in my head I wanted this PR to be about silent conversion of strings/bytes to numbers ... I'll try to review in more details to move this PR forward.

jnothman · 2018-02-20T20:57:05Z

I'm happy to deprecate decimal number string support, but we need to do so in the dtype='float' case too

lesteve · 2018-02-21T09:45:11Z

I have looked at this a bit more and I would be in favour of doing the following things:

not making a difference between arrays of strings and arrays of bytes. I would do the same as numpy which can convert both fine to float64 (for better or worse). Making a difference between array of strings and array of bytes also introduces a slight discrepancy between dtype='numeric' and dtype=np.float64
the warning should probably be a FutureWarning (rather than a DeprecatingWarning)
the warning needs an advice on how to get rid of the warning (basically it is strongly recommended to use .astype(np.float64) or .astype(np.float32) before feeding it into scikit-learn)
possibly adding a whats_new entry

jnothman · 2018-02-21T21:34:19Z

Let's go with that plan. Sorry @rtlee9 for the back and forth. A what's new is a good idea because a few users reported this as a bug

rtlee9 · 2018-02-22T08:52:16Z

No worries -- these all sounded like good suggestions to me. I've added a few commits (one per item in @lesteve 's list) and merged in master to remove conflicts in the whats_new entry. Please let me know if you have any edits, especially around warning/whats_new messaging.

jnothman · 2018-02-22T11:10:26Z

doc/whats_new/v0.20.rst

+Utils
+
+- Fixed a bug in :func:`utils.validation.check_array` to raise a ``FutureWarning``
+  indicating that arrays of type ``np.flexible`` will be interpreted as decimal


Just say strings

lesteve · 2018-02-22T12:38:00Z

sklearn/utils/validation.py

+        # in the future np.flexible dtypes will be handled like object dtypes
+        if dtype_numeric and np.issubdtype(array.dtype, np.flexible):
+            warnings.warn(
+                "Beginning in version 0.22, arrays of strings will be "


The standard user will very likely have little idea what check_array is. This is the wording I am proposing, better suggestions more than welcome:

"Beginning in version 0.22, arrays of bytes/strings will be " "interpreted as decimal numbers if dtype='numeric'. " "It is recommended that you convert the array to " "a float dtype before using it in scikit-learn, " "for example by using " "your_array = your_array.astype(np.float64)."

lesteve · 2018-02-22T12:38:55Z

I pushed a tweak in the whats_new entry and in tests. The wording of the warning can maybe be improved but otherwise LGTM.

jnothman · 2018-02-22T13:07:48Z

I'd like to hope not many people see the warning. Thanks @rtlee9

rtlee9 · 2018-02-22T17:17:24Z

Welcome, and thank you both for the reviews

amueller · 2018-07-16T21:49:44Z

I'm confused as to what this will do in the future. I don't understand the message :-/

amueller · 2018-07-16T21:50:41Z

This warning happens in feature selection when we try to transform the feature names btw. via #11570.

amueller · 2018-07-16T21:52:06Z

so the future behavior is to convert them to float? Can we change "interpret" to "convert"?

amueller · 2018-07-16T21:53:08Z

Also we should be clearer what the current behavior is, which is passing through arbitrary strings if dtype='numeric'.

jnothman · 2018-07-17T09:24:34Z

interesting that transforming feature names used to work. should feature selection use dtype=None?

rtlee9 force-pushed the check_array branch from 5ba9497 to eb8deb6 Compare January 18, 2018 03:16

rtlee9 changed the title ~~[WIP] Fix #10229: check_array should fail if array has strings~~ [MRG] Fix #10229: check_array should fail if array has strings Jan 18, 2018

lesteve reviewed Jan 18, 2018

View reviewed changes

rtlee9 added 3 commits January 18, 2018 19:28

Add deprecation warning to check_array for flexible array w/ dtype=nu…

eadae7c

…meric

Use assert_warns_message to test check_array deprecation warning

ccd1ced

Add deprecation starting version to warning message

9d6946e

rtlee9 force-pushed the check_array branch from 4bcfc92 to 9d6946e Compare January 19, 2018 03:33

NumPy may convert string list to dtype U1 or S1 so need to check both…

19b2bea

… in test

jnothman reviewed Jan 26, 2018

View reviewed changes

Make warning message more clear

2e851ce

jnothman reviewed Feb 16, 2018

View reviewed changes

Update check_array warning: reject byte arrays if numeric dtype

8a4b74e

jnothman approved these changes Feb 20, 2018

View reviewed changes

jnothman changed the title ~~[MRG] Fix #10229: check_array should fail if array has strings~~ [MRG+1] Fix #10229: check_array should fail if array has strings Feb 20, 2018

Update test to account for numpy conversion of string list to byte array

7734056

NumPy may convert string list to dtype U1 or S1 so need to accept both deprecation warnings

rtlee9 added 5 commits February 21, 2018 18:13

No differentiation between arrays of strings and arrays of bytes

d12630f

Raise FutureWarning instead of DeprecationWarning

5849f5c

Add recommendation to convert array to np.float64 to warning

4489841

Add whats_new bug fix entry

e377cb3

Merge branch 'master' into check_array

8c3bcca

jnothman approved these changes Feb 22, 2018

View reviewed changes

Tweak tests and whats_new entry

ffb1448

lesteve reviewed Feb 22, 2018

View reviewed changes

jnothman merged commit 7108d17 into scikit-learn:master Feb 22, 2018

rtlee9 deleted the check_array branch February 22, 2018 17:16

massich mentioned this pull request Jul 16, 2018

[MRG+1] Rework warning in check_array when silent convert string to float #11577

Merged

thomasjpfan mentioned this pull request Sep 29, 2020

MNT Correctly errors in check_array with dtype=numeric and string/bytes #18496

Merged

[MRG+1] Fix #10229: check_array should fail if array has strings #10495

[MRG+1] Fix #10229: check_array should fail if array has strings #10495

Conversation

rtlee9 commented Jan 18, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

lesteve Jan 18, 2018

Choose a reason for hiding this comment

rtlee9 Jan 19, 2018

Choose a reason for hiding this comment

rtlee9 Jan 19, 2018

Choose a reason for hiding this comment

lesteve Jan 18, 2018 • edited Loading

Choose a reason for hiding this comment

rtlee9 Jan 19, 2018

Choose a reason for hiding this comment

lesteve commented Jan 19, 2018

rtlee9 commented Jan 20, 2018

rtlee9 commented Jan 20, 2018

lesteve commented Jan 22, 2018

rtlee9 commented Jan 24, 2018

lesteve commented Jan 24, 2018

rtlee9 commented Jan 26, 2018

jnothman left a comment

Choose a reason for hiding this comment

jnothman commented Jan 26, 2018

lesteve commented Jan 29, 2018 • edited Loading

jnothman commented Jan 29, 2018 via email

rtlee9 commented Jan 30, 2018

jnothman commented Jan 30, 2018 via email

rtlee9 commented Feb 2, 2018

jnothman commented Feb 3, 2018 via email

rtlee9 commented Feb 16, 2018

jnothman left a comment

Choose a reason for hiding this comment

jnothman Feb 16, 2018

Choose a reason for hiding this comment

lesteve commented Feb 20, 2018

jnothman commented Feb 20, 2018 via email

lesteve commented Feb 20, 2018

jnothman commented Feb 20, 2018

lesteve commented Feb 21, 2018

jnothman commented Feb 21, 2018

rtlee9 commented Feb 22, 2018

jnothman Feb 22, 2018

Choose a reason for hiding this comment

lesteve Feb 22, 2018 • edited Loading

Choose a reason for hiding this comment

lesteve commented Feb 22, 2018

jnothman commented Feb 22, 2018

rtlee9 commented Feb 22, 2018

amueller commented Jul 16, 2018

amueller commented Jul 16, 2018 • edited Loading

amueller commented Jul 16, 2018

amueller commented Jul 16, 2018

jnothman commented Jul 17, 2018 via email

lesteve Jan 18, 2018 •

edited

Loading

lesteve commented Jan 29, 2018 •

edited

Loading

lesteve Feb 22, 2018 •

edited

Loading

amueller commented Jul 16, 2018 •

edited

Loading