Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tabular] Improve feature_metadata verification #3980

Open
Innixma opened this issue Mar 14, 2024 · 0 comments
Open

[Tabular] Improve feature_metadata verification #3980

Innixma opened this issue Mar 14, 2024 · 0 comments
Labels
bug Something isn't working enhancement New feature or request module: tabular priority: 2 Medium priority: Nice-to-have
Milestone

Comments

@Innixma
Copy link
Contributor

Innixma commented Mar 14, 2024

When a user specifies a custom feature_metadata object during fit, add additional guardrails to verify the compatibility between this feature metadata and the user provided training data, otherwise cryptic errors can occur downstream. There might also be a bug with NaN fills in this scenario.

AutoGluon Version: 1.x

Example with custom feature_metadata:

Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			WARNING: Actual dtype differs from dtype in FeatureMetadata for feature "loan_id". Actual dtype: int | Expected dtype: float

At test time when NaN is present:

---------------------------------------------------------------------------
IntCastingNaNError                        Traceback (most recent call last)
Cell In[7], line 1
----> 1 predictor.predict(test_data)

File ~/SageMaker/autogluon_1_0_0_python_39/lib/python3.9/site-packages/autogluon/tabular/predictor/predictor.py:1931, in TabularPredictor.predict(self, data, model, as_pandas, transform_features, decision_threshold)
   1929 if decision_threshold is None:
   1930     decision_threshold = self.decision_threshold
-> 1931 return self._learner.predict(X=data, model=model, as_pandas=as_pandas, transform_features=transform_features, decision_threshold=decision_threshold)

File ~/SageMaker/autogluon_1_0_0_python_39/lib/python3.9/site-packages/autogluon/tabular/learner/abstract_learner.py:208, in AbstractTabularLearner.predict(self, X, model, as_pandas, inverse_transform, transform_features, decision_threshold)
    206     decision_threshold = 0.5
    207 X_index = copy.deepcopy(X.index) if as_pandas else None
--> 208 y_pred_proba = self.predict_proba(
    209     X=X, model=model, as_pandas=False, as_multiclass=False, inverse_transform=False, transform_features=transform_features
    210 )
    211 problem_type = self.label_cleaner.problem_type_transform or self.problem_type
    212 y_pred = get_pred_from_proba(y_pred_proba=y_pred_proba, problem_type=problem_type, decision_threshold=decision_threshold)

File ~/SageMaker/autogluon_1_0_0_python_39/lib/python3.9/site-packages/autogluon/tabular/learner/abstract_learner.py:188, in AbstractTabularLearner.predict_proba(self, X, model, as_pandas, as_multiclass, inverse_transform, transform_features)
    186 else:
    187     if transform_features:
--> 188         X = self.transform_features(X)
    189     y_pred_proba = self.load_trainer().predict_proba(X, model=model)
    190 y_pred_proba = self._post_process_predict_proba(
    191     y_pred_proba=y_pred_proba, as_pandas=as_pandas, index=X_index, as_multiclass=as_multiclass, inverse_transform=inverse_transform
    192 )

File ~/SageMaker/autogluon_1_0_0_python_39/lib/python3.9/site-packages/autogluon/tabular/learner/abstract_learner.py:464, in AbstractTabularLearner.transform_features(self, X)
    462 def transform_features(self, X):
    463     for feature_generator in self.feature_generators:
--> 464         X = feature_generator.transform(X)
    465     return X

File ~/SageMaker/autogluon_1_0_0_python_39/lib/python3.9/site-packages/autogluon/features/generators/abstract.py:351, in AbstractFeatureGenerator.transform(self, X)
    349 if self._pre_astype_generator:
    350     X = self._pre_astype_generator.transform(X)
--> 351 X_out = self._transform(X)
    352 if self._post_generators:
    353     X_out = self._transform_generators(X=X_out, generators=self._post_generators)

File ~/SageMaker/autogluon_1_0_0_python_39/lib/python3.9/site-packages/autogluon/features/generators/bulk.py:175, in BulkFeatureGenerator._transform(self, X)
    173 feature_df_list = []
    174 for generator in generator_group:
--> 175     feature_df_list.append(generator.transform(X))
    177 if not feature_df_list:
    178     X = DataFrame(index=X.index)

File ~/SageMaker/autogluon_1_0_0_python_39/lib/python3.9/site-packages/autogluon/features/generators/abstract.py:351, in AbstractFeatureGenerator.transform(self, X)
    349 if self._pre_astype_generator:
    350     X = self._pre_astype_generator.transform(X)
--> 351 X_out = self._transform(X)
    352 if self._post_generators:
    353     X_out = self._transform_generators(X=X_out, generators=self._post_generators)

File ~/SageMaker/autogluon_1_0_0_python_39/lib/python3.9/site-packages/autogluon/features/generators/astype.py:157, in AsTypeFeatureGenerator._transform(self, X)
    151             X[with_null_features] = X[with_null_features].fillna(0)
    153     if self._type_map_real_opt:
    154         # TODO: Confirm this works with sparse and other feature types!
    155         # FIXME: Address situation where test-time invalid type values cause crash:
    156         #  https://stackoverflow.com/questions/49256211/how-to-set-unexpected-data-type-to-na?noredirect=1&lq=1
--> 157         X = X.astype(self._type_map_real_opt)
    158 return X

File ~/SageMaker/autogluon_1_0_0_python_39/lib/python3.9/site-packages/pandas/core/generic.py:6513, in NDFrame.astype(self, dtype, copy, errors)
   6511 else:
   6512     try:
-> 6513         res_col = col.astype(dtype=cdt, copy=copy, errors=errors)
   6514     except ValueError as ex:
   6515         ex.args = (
   6516             f"{ex}: Error while type casting for column '{col_name}'",
   6517         )

File ~/SageMaker/autogluon_1_0_0_python_39/lib/python3.9/site-packages/pandas/core/generic.py:6534, in NDFrame.astype(self, dtype, copy, errors)
   6530     results = [ser.astype(dtype, copy=copy) for _, ser in self.items()]
   6532 else:
   6533     # else, only a single dtype is given
-> 6534     new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
   6535     res = self._constructor_from_mgr(new_data, axes=new_data.axes)
   6536     return res.__finalize__(self, method="astype")

File ~/SageMaker/autogluon_1_0_0_python_39/lib/python3.9/site-packages/pandas/core/internals/managers.py:414, in BaseBlockManager.astype(self, dtype, copy, errors)
    411 elif using_copy_on_write():
    412     copy = False
--> 414 return self.apply(
    415     "astype",
    416     dtype=dtype,
    417     copy=copy,
    418     errors=errors,
    419     using_cow=using_copy_on_write(),
    420 )

File ~/SageMaker/autogluon_1_0_0_python_39/lib/python3.9/site-packages/pandas/core/internals/managers.py:354, in BaseBlockManager.apply(self, f, align_keys, **kwargs)
    352         applied = b.apply(f, **kwargs)
    353     else:
--> 354         applied = getattr(b, f)(**kwargs)
    355     result_blocks = extend_blocks(applied, result_blocks)
    357 out = type(self).from_blocks(result_blocks, self.axes)

File ~/SageMaker/autogluon_1_0_0_python_39/lib/python3.9/site-packages/pandas/core/internals/blocks.py:616, in Block.astype(self, dtype, copy, errors, using_cow)
    596 """
    597 Coerce to the new dtype.
    598 
   (...)
    612 Block
    613 """
    614 values = self.values
--> 616 new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
    618 new_values = maybe_coerce_values(new_values)
    620 refs = None

File ~/SageMaker/autogluon_1_0_0_python_39/lib/python3.9/site-packages/pandas/core/dtypes/astype.py:238, in astype_array_safe(values, dtype, copy, errors)
    235     dtype = dtype.numpy_dtype
    237 try:
--> 238     new_values = astype_array(values, dtype, copy=copy)
    239 except (ValueError, TypeError):
    240     # e.g. _astype_nansafe can fail on object-dtype of strings
    241     #  trying to convert to float
    242     if errors == "ignore":

File ~/SageMaker/autogluon_1_0_0_python_39/lib/python3.9/site-packages/pandas/core/dtypes/astype.py:183, in astype_array(values, dtype, copy)
    180     values = values.astype(dtype, copy=copy)
    182 else:
--> 183     values = _astype_nansafe(values, dtype, copy=copy)
    185 # in pandas we don't store numpy str dtypes, so convert to object
    186 if isinstance(dtype, np.dtype) and issubclass(values.dtype.type, str):

File ~/SageMaker/autogluon_1_0_0_python_39/lib/python3.9/site-packages/pandas/core/dtypes/astype.py:101, in _astype_nansafe(arr, dtype, copy, skipna)
     96     return lib.ensure_string_array(
     97         arr, skipna=skipna, convert_na_value=False
     98     ).reshape(shape)
    100 elif np.issubdtype(arr.dtype, np.floating) and dtype.kind in "iu":
--> 101     return _astype_float_to_int_nansafe(arr, dtype, copy)
    103 elif arr.dtype == object:
    104     # if we have a datetime/timedelta array of objects
    105     # then coerce to datetime64[ns] and use DatetimeArray.astype
    107     if lib.is_np_dtype(dtype, "M"):

File ~/SageMaker/autogluon_1_0_0_python_39/lib/python3.9/site-packages/pandas/core/dtypes/astype.py:146, in _astype_float_to_int_nansafe(values, dtype, copy)
    142 """
    143 astype with a check preventing converting NaN to an meaningless integer value.
    144 """
    145 if not np.isfinite(values).all():
--> 146     raise IntCastingNaNError(
    147         "Cannot convert non-finite values (NA or inf) to integer"
    148     )
    149 if dtype.kind == "u":
    150     # GH#45151
    151     if not (values >= 0).all():

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer: Error while type casting for column 'loan_id'
@Innixma Innixma added enhancement New feature or request module: tabular priority: 2 Medium priority: Nice-to-have labels Mar 14, 2024
@Innixma Innixma added this to the Feature Backlog milestone Mar 14, 2024
@Innixma Innixma added the bug Something isn't working label Mar 14, 2024
@Innixma Innixma modified the milestones: Feature Backlog, 1.1 Release Mar 14, 2024
@Innixma Innixma modified the milestones: 1.1 Release, 1.2 Release Apr 12, 2024
@Innixma Innixma modified the milestones: 1.2 Release, 2025 Tracker Nov 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request module: tabular priority: 2 Medium priority: Nice-to-have
Projects
None yet
Development

No branches or pull requests

1 participant