Skip to content

torch.save fails with pickler.dump for bytes #1094

Closed
@zhangguanheng66

Description

❓ Questions and Help

This is probably not an issue with pytorch. I try to torch.save an ops with bytes from pybind11 pickle. Here is the pybind11 registration link.

How do I torch.save the ops:

import torchtext
import torch
from torchtext.experimental.transforms import  PRETRAINED_SP_MODEL, load_sp_model
sp_model_path = torchtext.utils.download_from_url(PRETRAINED_SP_MODEL['text_unigram_25000'])
sp_model = load_sp_model(sp_model_path)
f = open("temp.pt", "wb")
torch.save(sp_model, f)

The error message I got:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-7-c66cdf260b7d> in <module>
      1 f = open("temp.pt", "wb")
----> 2 torch.save(sp_model, f)

~/tmp/PyTorch/pytorch/torch/serialization.py in save(obj, f, pickle_module, pickle_protocol, _use_new_zipfile_serialization)
    370         if _use_new_zipfile_serialization:
    371             with _open_zipfile_writer(opened_file) as opened_zipfile:
--> 372                 _save(obj, opened_zipfile, pickle_module, pickle_protocol)
    373                 return
    374         _legacy_save(obj, opened_file, pickle_module, pickle_protocol)

~/tmp/PyTorch/pytorch/torch/serialization.py in _save(obj, zip_file, pickle_module, pickle_protocol)
    474     pickler = pickle_module.Pickler(data_buf, protocol=pickle_protocol)
    475     pickler.persistent_id = persistent_id
--> 476     pickler.dump(obj)
    477     data_value = data_buf.getvalue()
    478     zip_file.write_record('data.pkl', data_value, len(data_value))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 119: invalid start byte

cc @mruberry

Activity

zhangguanheng66

zhangguanheng66 commented on Dec 7, 2020

@zhangguanheng66
ContributorAuthor

This is about the usage of PyBind11. See https://pybind11.readthedocs.io/en/stable/advanced/cast/strings.html#return-c-strings-without-conversion

Changing the binding to this works
https://github.com/mthrok/text/blob/b0f88ba56603590444a5c10e216573ebce0740ad/torchtext/csrc/register_bindings.cpp#L52-L74

Thanks @mthrok, how about the torchbind one? I saw a similar issue when saving and loading the torchbind SP model. As you found the solution, do you want to send out a PR and land the pybind pickle support for SP model?

mthrok

mthrok commented on Dec 8, 2020

@mthrok
Contributor

@zhangguanheng66

I saw a similar issue when saving and loading the torchbind SP model.

Is there an issue reported that I can look at?

zhangguanheng66

zhangguanheng66 commented on Dec 8, 2020

@zhangguanheng66
ContributorAuthor

@zhangguanheng66

I saw a similar issue when saving and loading the torchbind SP model.

Is there an issue reported that I can look at?

Yup. Here is the code snippet to reproduce the serialization issue with torchbind sentencepiece model

import torchtext
import torch
from torchtext.experimental.transforms import  PRETRAINED_SP_MODEL, sentencepiece_tokenizer
sp_model_path = torchtext.utils.download_from_url(PRETRAINED_SP_MODEL['text_unigram_25000'])
sp_model = sentencepiece_tokenizer(sp_model_path).to_ivalue()
torch.save(sp_model, "temp.pt")

Same issue is observed for the pybind one.

mthrok

mthrok commented on Dec 8, 2020

@mthrok
Contributor

Okay so I believe this is not pytorch fire issue, so I am transferring the issue to torchtext.

transferred this issue frompytorch/pytorchon Dec 8, 2020
self-assigned this
on Dec 8, 2020
mthrok

mthrok commented on Dec 23, 2020

@mthrok
Contributor

Addressed in #1104

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    torch.save fails with pickler.dump for bytes · Issue #1094 · pytorch/text