Embeddings Layer with Pipeline #30369
Unanswered
terencechow
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm using a Pipeline to transform my data and fit a custom estimator neural net written in PyTorch.
I'm doing this to take advantage of both Pipeline and GridSearchCV to understand hyperparameter chagnes. Part of my code is below:
This works fine for one hot encoded categories. since the cat transformer can use OneHotEncoder and X can be transformed to a wider X. However if I want to use embeddings for the categorical columns, it isn't just a wider X, but a lookup table for each different categorical value. Therefore X does not change (outside of encoding as numeric categories), but I want the preprocessing to pass some of that information to the estimator.
For example, I'd need to know which indices in X are supposed to be embedded so I can retrieve the relevant embeding layer. I'd also like to know the number of categories in each embedding column since that dictates the dimension size of the embedding layers I need to create. I'd like to do this as part of the pipeline, so that my grid search cv can easily change which column to embed or one hot encode, and the estimator can dynamically determine embedding layers.
I think it would be something like:
ColumnTransformer -> cat_transormer ->
Estimator requests the metadata during fit etc
However I'm not seeing anyway to update metadata on the pipeline from a column transformer so that it can flow to further steps in the pipeline. Is this possible / supported? If not, are there any suggestions?
Beta Was this translation helpful? Give feedback.
All reactions