How to transform the string data to numerical when using make_batch_reader?

My parquet file is as follows (two files):
```
  item_name  price
0       laptop   10.0
1         book   20.0
2          cup   30.0
  item_name  price
0        phone   11.0
1        dress   22.0
```

Since `make_batch_reader` only supports loading scalar data type, I tried to use `TransformSpec` to convert `item_name` filed to one-hot encoding matrix, using the following function:
```
def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    res = res.drop([feature_to_encode], axis=1)
    return(res) 
```


My code is as follows:

```
dataset_url = "hdfs://my_data/parquet_dataset"
reader_epochs = 1
B_SIZE = 2

for training_epoch in range(1):
    with BatchedDataLoader(
        make_batch_reader(
            dataset_url,
            num_epochs=reader_epochs,
            schema_fields=[
                           "item_name_cup",
                           "item_name_book",
                           "price",
                           "item_name_laptop",
                           "item_name_dress",
                           "item_name_phone"],
            transform_spec=transform,
            seed=1,
            shuffle_rows=False,
            shuffle_row_groups=False),
        batch_size=B_SIZE
    ) as train_loader:

        for batch_idx, row in enumerate(train_loader):
            print(f"batch_idx:{batch_idx}")
            print(f"row:{row}")
            break
```

But I got `KeyError: "None of [Index(['item_name'], dtype='object')] are in the [columns]"`. How may I resolve this? I was expecting to the the following schema:
```
"price",  --> float
"item_name_cup",  --> int (0 or 1)
"item_name_book",  --> int (0 or 1)
"item_name_laptop",  --> int (0 or 1)
"item_name_dress",  --> int (0 or 1)
"item_name_phone".  --> int (0 or 1)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to transform the string data to numerical when using make_batch_reader? #788

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development