Skip to content

Encoding issue? #608

Closed
Closed
@ardumont

Description

Hello,

I have a repository whose refs/commit messages are in cyrillic:

>>> import os
>>> list(os.walk(b'refs/heads'))
[(b'refs/heads', [], [b'\xcd\xee\xe2\xe0\xff\xe2\xe5\xf2\xea\xe01', b'master'])]
>>> s = b'\xcd\xee\xe2\xe0\xff\xe2\xe5\xf2\xea\xe01'
>>> s.decode('latin1')
'Íîâàÿâåòêà1'  # seems like rubbish
>>> s.decode('cp1251')
'Новаяветка1'  # looks like russian  -> google translates agrees: `newlight1`

google-translates 'Новаяветка1' as newlight1.

... and somehow, that makes dulwich break:

$ python3
>>> from dulwich.repo import Repo
>>> r = Repo('.')
>>> r.refs
DiskRefsContainer('.')
>>> r.refs.allkeys()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/dulwich/refs.py", line 470, in allkeys
    sys.getfilesystemencoding())
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 11-20: surrogates not allowed

I think it's not the expected behavior according to my understanding of the doc.

Do you know how could i overcome this?

Thanks for your help.

Cheers,

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions