How to transform the string data to numerical when using make_batch_reader? #788
Open
Description
My parquet file is as follows (two files):
item_name price
0 laptop 10.0
1 book 20.0
2 cup 30.0
item_name price
0 phone 11.0
1 dress 22.0
Since make_batch_reader
only supports loading scalar data type, I tried to use TransformSpec
to convert item_name
filed to one-hot encoding matrix, using the following function:
def encode_and_bind(original_dataframe, feature_to_encode):
dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
res = pd.concat([original_dataframe, dummies], axis=1)
res = res.drop([feature_to_encode], axis=1)
return(res)
My code is as follows:
dataset_url = "hdfs://my_data/parquet_dataset"
reader_epochs = 1
B_SIZE = 2
for training_epoch in range(1):
with BatchedDataLoader(
make_batch_reader(
dataset_url,
num_epochs=reader_epochs,
schema_fields=[
"item_name_cup",
"item_name_book",
"price",
"item_name_laptop",
"item_name_dress",
"item_name_phone"],
transform_spec=transform,
seed=1,
shuffle_rows=False,
shuffle_row_groups=False),
batch_size=B_SIZE
) as train_loader:
for batch_idx, row in enumerate(train_loader):
print(f"batch_idx:{batch_idx}")
print(f"row:{row}")
break
But I got KeyError: "None of [Index(['item_name'], dtype='object')] are in the [columns]"
. How may I resolve this? I was expecting to the the following schema:
"price", --> float
"item_name_cup", --> int (0 or 1)
"item_name_book", --> int (0 or 1)
"item_name_laptop", --> int (0 or 1)
"item_name_dress", --> int (0 or 1)
"item_name_phone". --> int (0 or 1)
Metadata
Assignees
Labels
No labels