Seeing worse model performance from using petastorm vs normal pytorch dataloader

Hello! I'm using an open-source template of skip-gram to train ID embeddings from parquet data (stored in parquet format, with a few columns all int64 type) with `petastorm` but I'm getting far lower quality / worse embedding representations from using this library in comparison to just loading data directly from a PyTorch's non-distributed dataloader (with everything held constant, such as batch size, learning rate, data, and so on). 

I'm wondering if I'm simply loading data w/ the library in an incorrect way. Here's a snippet of my code that uses `petastorm.`

```
...
        for epoch in range(starting_epoch, total_epochs):

            print(f'Beginning epoch: {epoch + 1}/{total_epochs}')
            running_loss = 0.0 
            num_examples = 0
            with DataLoader(make_batch_reader(self.data_path), batch_size=self.args.batch_size) as train_loader:
                for batch_idx, row in enumerate(train_loader):
                # Unpack data
                    center, context, neg1, neg2, neg3 = row['id1_mapped'], row['id2_mapped'], row['id3_mapped'], row['id4_mapped'], row['id5_mapped']
                    center = torch.unsqueeze(center, 1)
                    context = torch.unsqueeze(context, 1)
                    neg1 = torch.unsqueeze(neg1, 1)
                    neg2 = torch.unsqueeze(neg2, 1)
                    neg3 = torch.unsqueeze(neg3, 1)
                    
                    center, context = center.to(self.args.device), context.to(self.args.device)
                    
                    # Remove accumulated gradients
                    self.optim.zero_grad()
                    # Get context vectors
                    center_embed, context_embed = self.model(center, context)
                    # print(center_embed.shape, context_embed.shape)
                    # Calc loss: SGNS
                    loss = self.sgns(center_embed, context_embed, neg1, neg2, neg3)
                    # break
                    # Backprop and update
                    loss.backward()
                    self.optim.step()

                    running_loss += loss.item()
                    global_step += 1
                    num_examples += len(center)  # Last batch's size may not equal args.batch_size

                norm = (batch_idx + 1) * num_examples

                self.evaluate_embeddings(epoch)
                self.log_and_save_epoch(epoch, running_loss / norm)

                self.log_step(epoch, global_step, running_loss / norm)#, testing_loss / norm)
...

```
My only hunch at this point is that I'm not properly shuffling the data between epochs, which will cause the data within each row group to have the same sequential order unlike PyTorch dataloader which shuffles entire dataset (from a csv file) before each epoch. Is there a straightforward way to add epoch shuffling in petastorm? (since all the data is stored in multiple parquet files).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeing worse model performance from using petastorm vs normal pytorch dataloader #793

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Seeing worse model performance from using petastorm vs normal pytorch dataloader #793

Description

Activity

Data-drone commented on Apr 23, 2023

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions