Embeddings Layer with Pipeline #30369

terencechow · 2024-11-29T03:50:50Z

terencechow
Nov 29, 2024

I'm using a Pipeline to transform my data and fit a custom estimator neural net written in PyTorch.

I'm doing this to take advantage of both Pipeline and GridSearchCV to understand hyperparameter chagnes. Part of my code is below:

column_transformers = ColumnTransformer(
    remainder='passthrough',
    transformers=[
        ("dropper", "drop", drop_columns),
        ("numeric", numeric_transforms, numeric_columns),
        ("ordinal", ordinal_transforms, make_column_selector(pattern="ordinal-")),
        ("cat_encoded", cat_transforms, make_column_selector(pattern="category-")),
    ],
    force_int_remainder_cols=False,
)

pipe = Pipeline(steps=[
    ("preprocessing", column_transformers),
    ("estimator", TorchNNEstimator()),
])

inner_cv = KFold(n_splits=n_splits, shuffle=True)
outer_cv = KFold(n_splits=n_splits, shuffle=True)

search = GridSearchCV(pipe, param_grid, verbose=1, cv=inner_cv, error_score='raise')

nested_score = cross_validate(search, X=X, y=y, cv=outer_cv, return_train_score=True, return_estimator=True)

This works fine for one hot encoded categories. since the cat transformer can use OneHotEncoder and X can be transformed to a wider X. However if I want to use embeddings for the categorical columns, it isn't just a wider X, but a lookup table for each different categorical value. Therefore X does not change (outside of encoding as numeric categories), but I want the preprocessing to pass some of that information to the estimator.

For example, I'd need to know which indices in X are supposed to be embedded so I can retrieve the relevant embeding layer. I'd also like to know the number of categories in each embedding column since that dictates the dimension size of the embedding layers I need to create. I'd like to do this as part of the pipeline, so that my grid search cv can easily change which column to embed or one hot encode, and the estimator can dynamically determine embedding layers.

I think it would be something like:

ColumnTransformer -> cat_transormer ->

updates string words to numeric categories, AND
somehow updates metadata on the pipeline

Estimator requests the metadata during fit etc

However I'm not seeing anyway to update metadata on the pipeline from a column transformer so that it can flow to further steps in the pipeline. Is this possible / supported? If not, are there any suggestions?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embeddings Layer with Pipeline #30369

{{title}}

Replies: 0 comments

Select a reply

Embeddings Layer with Pipeline #30369

terencechow Nov 29, 2024

Replies: 0 comments

terencechow
Nov 29, 2024