Closed
Description
Hi
I have 2 files in cyrillic. I can read both without issue in MS Word.
The first seems to work fine with:
with open(fullpath) as infile:
content = infile.read()
text = rtf_to_text(content ,'ignore')
The second (bad.zip) gets turned into chinese characters
sample output from the good one:
>>> tabtext =text.split("|||")
>>> print(tabtext[0])
Таблиця розподілу номерного ресурсу
Кіровоградська область|
Код зони - 52
sample output from the bad one:
>>> tabtext =text.split("|")
>>> print(tabtext[0])
亦犭桷 痤顼钿畴 眍戾痦钽 疱耋瘃
它獬怦赅 钺豚耱鼃
暑 珙龛 - 32
if i leave out the "ignore", i get:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xff in position 6: illegal multibyte sequence
any idea how i can work around this?
Metadata
Assignees
Labels
No labels