#3313 closed defect (fixed)
unicode error when invoking lyx2lyx
Reported by: | Uwe Stöhr | Owned by: | José Matos |
---|---|---|---|
Priority: | high | Milestone: | 1.5.5 |
Component: | lyx2lyx | Version: | 1.5.0svn |
Severity: | major | Keywords: | |
Cc: | jamatos@…, j.spitzmueller@…, georg.baum@…, anek@… |
Description
Try to export the attached LyX-file to LyX 1.3.x or 1.4.x-format. Result is this
error:
An error occurred whilst running python -tt "C:/Program Files (x86)/LyX 1.5beta2
-xx
Traceback (most recent call last):
File "C:/Program Files (x86)/LyX 1.5beta2-xx-02-2007/Resources/lyx2lyx/lyx2lyx
", line 101, in <module>
sys.exit(main(sys.argv))
File "C:/Program Files (x86)/LyX 1.5beta2-xx-02-2007/Resources/lyx2lyx/lyx2lyx
", line 95, in main
file.write()
File "C:\Program Files (x86)\LyX 1.5beta2-xx-02-2007\Resources\lyx2lyx\LyX.py"
, line 274, in write
self.output.write(line.encode(self.encoding)+"\n")
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2044' in position
3: ordinal not in range(256)
Error: Cannot convert file
Attachments (8)
Change History (67)
by , 17 years ago
comment:1 by , 17 years ago
Milestone: | → 1.5.0 |
---|
comment:2 by , 17 years ago
That happens if a character cannot be encoded in the encoding of the old file
format. We can probably not do much here, only give a better error message.
comment:3 by , 17 years ago
The case of an en-dash gives an error:
...
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in position 32: character maps to
<undefined>
Error: Cannot convert file
Would it be possible to catch this error (and the one for an em-dash) and convert to -- (---)?
For viewing and PostScript export, the problem was solved in #3565.
comment:4 by , 17 years ago
Yes. See revert_misc_unicode() in
http://www.lyx.org/trac/browser/lyx-devel/branches/personal/baum/playground/lib/lyx2lyx/lyx_1_5.py
for a start. There are several unciode characters that can be converted to LyX
insets (ore more trivial simple characters as in the dash case).
comment:5 by , 17 years ago
I tried to make a function based on George's code. This seems to work.
I have never worked with any part of the LyX code, and never with unicode, so the code may be very
ugly, but it works (at least for me).
Some problems:
- Relative paths (to unicodesymbols) doesn't work for me (perhaps because I presume the wrong parent
directory). How do I get around that?
- I'm not sure where in the Conversion hub to call the function. At convert 246 and revert 270 works
for me, but this was a random choice.
Further, I have not implemented the case of preamble flags. It shouldn't be difficult, but I don't know
where in the document.header to put them and I don't know if there are other things that need to be
considered (e.g. document class). The current code captures most "simple cases" like registered trade
mark, en dash, e acute, etc.
So here's the code. Please feel free to slash it. Btw is there a style guide for LyX Python scripts?
def revert_unicode(document):
# Transform unicode symbols according to the unicode list
# Absoulte path used. Should be changed.
fp=open('/Users/anek/LyX_dev/nightly/LyX_May_25.app/Contents/Resources/unicodesymbols','r');
spec_chars = {};
for line in fp.readlines():
if line[0] != '#':
line=line.replace('"',); #remove all qoutation marks
try:
# flag1 and flag2 are preamble flags NOTE! Not implemented
# flag1=# -> no flags, flag2=# -> only flag 1
[ucs4,command,flag1,flag2] =line.split(None,3);
ert_intro='\n\n
begin_inset ERT\nstatus collapsed\n
begin_layout Standard\n
backslash
\n';
ert_outro='\n
end_layout\n
end_inset\n\n';
if command[0:2] == '
':
command = command.replace('
', ert_intro);
command = command + ert_outro;
spec_chars[unichr(eval(ucs4))] = [command, flag1, flag2];
except:
pass
fp.close();
for i in range(len(document.body)):
for j in range(len(document.body[i])):
if spec_chars.has_key(document.body[i][j]):
document.body[i] = document.body[i][0:j] + spec_chars[document.body[i][j]][0] +
document.body[i][j+1:]
comment:7 by , 17 years ago
This is even more than I had in mind. Some notes:
- You can find python code to read the unicodesymbols file also in
development/tools/unicodesymbols.py
- This function should only be called for revert, not for convert. The best
place is at the 249->248 step.
- In general, it needs some safety checks added, and it needs to behave
differently if the character is alreay in ERT.
- Some characters could be converted to native LyX insets, e.g. the different
quotes. That would have the advantage that a round trip would not add ERT.
If you want to have this included, get Jose to have a look at the remaining
problems.
comment:8 by , 17 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:9 by , 17 years ago
Cc: | added |
---|---|
Keywords: | patch added |
--> (http://bugzilla.lyx.org/attachment.cgi?id=1846&action=view)
Improved function to replace unicode to be included in lyx_1_5.py
Please send it to the lyx-devel list so that it can be approved by other
developers and then be committed to the LyX sources.
by , 17 years ago
Attachment: | revert_unicode_070530.txt added |
---|
Patch previously posted to the lyx-devel list
comment:10 by , 17 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:11 by , 17 years ago
hello, (i'm no expert)
i had this problem too. i wanted to convert from LyX 1.5.0rc1 lyxformat 271 to
lyxformat 245. in my file i had quote signs in a footnote. after i removed them
manually the conversion worked.
comment:13 by , 17 years ago
The status is that the current patch works and has been has been stable for me, but since I am not
familiar with the LyX document format there may of course be mistakes. I tried the best I could to test for
any such.
I didn't get any response on whether to add commands to the preamble, so I haven't done anything there.
This means that the conversion to 1.4 will work but LyX will complain about missing package for some of
the special characters. Still, I think is better than the current failure of the conversion.
Finally, for a unicode character not in the unicodesymbols list, the conversion will fail as before.
comment:14 by , 17 years ago
Is someone working on something better here?
Otherwise I would suggest that the patch goes in (unless someone found errors in the code) so that there
is a little time for feedback before RC2 (and to make life a little bit easier for us working with colleagues
who use 1.4 ;-)
comment:15 by , 17 years ago
Anders, José applied your patch:
http://www.lyx.org/trac/changeset/18890
Can I mark this bug as fixed or are there some issues left?
comment:16 by , 17 years ago
is it possible to change the revised name of the translated document to
xxx_14.lyx (instead of xxx.lyx14). The current version requires a manual
renaming on the Mac, which is a mess.
Please open a new bug report for this.
Tested the patched LyX and found a bug I should have seen (since I am Swedish I
should have tested for ä and ö...).
Please send this patch to the lyx devel-list.
the test case I submitted to this bug has problems in LyX1.5. It complains
that some characters are not representable in the chosen encoding.
The problem is that you used the character 2264 "â¤" in a formula. But there you
must create this character using the command "\le" or by using the math toolbar.
comment:17 by , 17 years ago
Please send this patch to the lyx devel-list.
Done
Please open a new bug report for this.
Done -- #3934
The problem is that you used the character 2264 "â¤" in a formula. But there you
must create this character using the command "\le" or by using the math toolbar.
Yes, that was the point. I copied the character from a unicode text. Since it pastes, you would assume
that the character is converted to a command in the process when it is converted to a pdf (as it is when
it is converted to LyX1.4). In other words, the end-user would expect anything that is possible to paste
to come out in print.
comment:18 by , 17 years ago
Subject: Re: unicode error when invoking lyx2lyx
In other words, the end-user would expect anything that is possible to paste to come out in print.
I see, so another bug report is needed for this. Could you do this please?
comment:19 by , 17 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
The last part to fix this is in:
http://www.lyx.org/trac/changeset/18918
comment:20 by , 17 years ago
Resolution: | fixed |
---|---|
Severity: | normal → blocker |
Status: | closed → reopened |
Not fixed: Open the UserGuide and try to export it to LyX 1.4.x format:
File "C:\Program Files (x86)\LyX 1.5rc2-27-05-2007\Resources\lyx2lyx\LyX.py",
line 278, in write
self.output.write(line.encode(self.encoding)+"\n")
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u03bc' in positio
0: ordinal not in range(256)
Error: Die Datei kann nicht konvertiert werden
I expect this should be easy to fix, Anders?
A blocker for LyX 1.5.0.
comment:21 by , 17 years ago
Does it go as far as to reverse_unicode before crashing? I get in the console
File "/Users/anek/Desktop/LyX_15rc2.app/Contents/Resources/lyx2lyx/lyx2lyx", line 101, in ?
sys.exit(main(sys.argv))
File "/Users/anek/Desktop/LyX_15rc2.app/Contents/Resources/lyx2lyx/lyx2lyx", line 95, in main
file.write()
File "/Users/anek/Desktop/LyX_15rc2.app/Contents/Resources/lyx2lyx/LyX.py", line 278, in write
self.output.write(line.encode(self.encoding)+"\n")
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u03bc' in position 0: ordinal not in
range(256)
Error: Cannot convert file
If it gets to the reverse_unicode function, the conversion will fail if the symbol is not in the
unicodesymbols list.
A search for \u03bc gives
<!ENTITY mu CDATA "μ" -- greek small letter mu, U03BC ISOgrk3 -->
the closest I find in the unicodesymbols list is
0x00b5 "
textmu" "textcomp" "force" # µ MICRO SIGN
which I guess is not the same (as you understand, I am far from an expert in unicode). So I guess this is
why the script failed.
There was a discussion before about greek characters before on the list and if I remember correctly,
the consensus was that these should not be added to the unicode list. Instead the text should be
marked as greek.
So, provided the error is what I am guessing here, I see three options: expand the unicodesymbols list
(a lot of work), mark the relevant text section as greek or input it as a symbol (should be easy) or add a
fail-safe check in lyx2lyx that translate remaining non-translated unicode symbols to question-marks
or similar (preferable I guess, but probably tricky since I guess the implementation will depend on to
which encoding the document is translated to).
comment:23 by , 17 years ago
Cc: | added |
---|
The problematic character *is* the correct µ symbol, which is in the
unicodesymbols list and does get output as \textmu in LaTeX. So something else
must go wrong.
BTW it doesn't crash on Linux (only output an error message).
comment:24 by , 17 years ago
Sorry for my vagueness (it was quite late...) LyX dosn't crash on the Mac either. The conversion script
fails and results in an error message.
Removing µ resolves this. And I can replicate the error by just creating a new document and pasting (or
typing) µ. I can't understand why this is. Taking a similar character like ± the conversion works (as
known, I need to add \usepackage{textcomp} manually to the Preamble to get any pdf output).
What I wonder about is the difference between the character we have:
http://www.fileformat.info/info/unicode/char/03bc/index.htm
and the micro sign
http://www.fileformat.info/info/unicode/char/00b5/index.htm
The unicodesymbols list should be for the latter.
Adding
0x03bc "
textmu" "textcomp" "force" # µ MICRO SIGN
to the unicodesymbols list solves the problem.
The document still won't open in LyX1.4 though (worse, it gets into an eternal loop...).
I think this could be related to
#3404
although in this case the console claims some missing \end_layout (and not too many). There's no
mentioning of unicode errors in the error message.
So my suggestion is to add an entry for 0x03bc to the unicodesymbols (I don't dare to put a patch with
the line above since I don't know if that is the correct command to use).
comment:25 by , 17 years ago
Subject: Re: unicode error when invoking lyx2lyx
What I wonder about is the difference between the character we have:
http://www.fileformat.info/info/unicode/char/03bc/index.htm
and the micro sign
http://www.fileformat.info/info/unicode/char/00b5/index.htm
OK, this is the bug "03bc" is the Greek character, not the µ character for e.g. "µm" I'll cottect
this in the manuals.
The unicodesymbols list should be for the latter.
Adding
0x03bc "
textmu" "textcomp" "force" # µ MICRO SIGN
to the unicodesymbols list solves the problem.
We agreed not to translate Greek characters. 03bc looks btw. a bit different than "00b5".
Thanks for investigating.
comment:26 by , 17 years ago
OK, this is the bug "03bc" is the Greek character, not the µ character for
e.g. "µm" I'll cottect this in the manuals.
This does not fix it for me. As I wrote in comment 26: this is already the
correct character, but it somehow gets interpreted as the greek mu.
Testcase:
- new document
- M-x unicode-insert 0x00b5
- export to 1.4
=> same error.
comment:27 by , 17 years ago
This does not fix it for me.
Not even if you modify the unicodesymbols file? (OK, it's technically not a fix, but does that make the
conversion work?)
As I wrote in comment 26: this is already the correct character,
but it somehow gets interpreted as the greek mu.
So somewhere between the input and the call to revert_unicode the µ character gets re-interpreted.
I guess this should be filed as a new bug.
comment:28 by , 17 years ago
Not even if you modify the unicodesymbols file?
I didn't try. But this does only hide the bug anyway.
So somewhere between the input and the call to revert_unicode the µ character
gets re-interpreted.
Looks like it.
I guess this should be filed as a new bug.
No. I think this is the bug.
comment:29 by , 17 years ago
Cc: | added |
---|
I'm pretty sure this is a python bug:
MICRO is misread as '\u03bc' (GREEK SMALL LETTER MU). The same error occurs for
OHM, which is parsed as '\u03a9' (GREEK CAPITAL LETTER OMEGA).
I don't know what we can do about that. Georg?
comment:30 by , 17 years ago
The character is changed in revert_accent, probably by the call of
unicodedata.normalize. I don't think that this is special python behaviour. I
rather suspect that 0x03bc and 0x00b5 are considered equal by the unicode
standard.
If that is the case then they should not produce different output in LyX
either.
comment:31 by , 17 years ago
I rather suspect that 0x03bc and 0x00b5 are considered equal by
the unicode standard.
So it seems, see:
http://www.cs.tut.fi/~jkorpela/chars/si.html
(search for micro)
If that is the case then they should not produce different output in
LyX either.
Then I guess the easiest fix is to add the compatibility characters to the unicode symbols list with
identical commands?
comment:32 by , 17 years ago
Then I guess the easiest fix is to add the compatibility characters to the
unicode symbols list with
identical commands?
No, I still think we should distinguish the unit symbols from the greek
characters, if possible.
comment:33 by , 17 years ago
No, I still think we should distinguish the unit symbols from the
greek characters, if possible.
Replacing
unicodedata.normalize("NFKD", ...
with
unicodedata.normalize("NFD", ...
and
unicodedata.normalize("NFKC", ...
with
unicodedata.normalize("NFC", ...
seems to do the trick, though I have no idea if this causes unwanted side-effects?
comment:34 by , 17 years ago
I just found that out as well. However, it only works for MICRO, not for OHM.
comment:35 by , 17 years ago
However, it only works for MICRO, not for OHM.
Strange...
I guess this hints at a key question:
Can we be sure that (now or in the future) no conversion takes place anywhere in a Python, QT or other
routine that is out of our control?
AFAIU this would be completely legitimate according to the unicode standard.
comment:36 by , 17 years ago
However, it only works for MICRO, not for OHM.
Strange...
No, it's not strange: MICRO/GREEK MU are defined "compatible", while OHM/OMEGA
are defined as "canonical equivalent".
NFKC/NFKD is "Compatibility Decomposition", so two compatible signs as MICRO/MU
are "normalized". NFD/NFC, in contrast, is "Canonical Decomposition", so the
compatible signs are left untouched, but the "canonical" (such as OHM/OMEGA)
are still normalized. I guess there's no normalization that leaves OHM/OMEGA
untouched.
comment:37 by , 17 years ago
Subject: Re: unicode error when invoking lyx2lyx
14:44 -------
No, it's not strange: MICRO/GREEK MU are defined "compatible", while
OHM/OMEGA
are defined as "canonical equivalent".
NFKC/NFKD is "Compatibility Decomposition", so two compatible signs as
MICRO/MU
are "normalized". NFD/NFC, in contrast, is "Canonical Decomposition", so
the
compatible signs are left untouched, but the "canonical" (such as
OHM/OMEGA)
are still normalized. I guess there's no normalization that leaves
OHM/OMEGA
untouched.
Correct (see http://www.unicode.org/reports/tr15/)
That does also mean that NFC/NFD should be used instead of NFKC/NFKD, both
in lyx2lyx and LyX (normalize_kc should be replaced by normalize_c).
An easy way to prevent normalization of OHM to OMEGA would be a wrapper
around the normalization function: It could split the input in chunks,
taking the critical characters as boundaries, and then only pass the
chunks without the critical characters to the real normalizer.
comment:38 by , 17 years ago
Subject: Re: unicode error when invoking lyx2lyx
I rather suspect that 0x03bc and 0x00b5 are considered equal by the unicode
standard.
Not here on Windows: 0x03bc and 0x00b5 look different arial and Times New Roman. Furthermore some
LaTeX-fonts don't support Greek characters, only the 0x00b5 µ.
We cannot translate 0x03bc in unicodesymbols because then Greek users wil see ugly mu because
\textmu uses another font than original greek fonts.
by , 17 years ago
change normalization from compatibility to canonical decomposition
comment:39 by , 17 years ago
Severity: | blocker → critical |
---|
This patch does the change proposed in comment 40 and fixes the reversion of
the MICRO sign. The wrapper idea is not yet implemented, so OHM is still a
problem.
But the patch is the first step and makes the UserGuide processable. As this is
a major bug I propose to put the patch in and I replace the OHM sign in the docs
by \textohm until the wrapper is ready.
comment:40 by , 17 years ago
I propose to put the patch in and I replace the OHM sign in the docs
by \textohm until the wrapper is ready.
Wouldn't it be better to have "both ohms" in the unicodesymbols file (with command \textohm) until the
wrapper is ready? In this way you won't need to revise the User's Guide and you also have a solution that
works for all documents.
comment:41 by , 17 years ago
Subject: Re: unicode error when invoking lyx2lyx
But the patch is the first step and makes the UserGuide processable. As
this is a major bug I propose to put the patch in and I replace the OHM
sign in the docs by \textohm until the wrapper is ready.
I would consider that an ugly hack, and I don't see the need for it:
Implementing that wrapper does not take longer than 'fixing' the userguide
and reverting that at a later stage. This is really easy text processing,
and it does not need to be very efficient either, since it is only called
on explicit user request.
comment:42 by , 17 years ago
Implementing that wrapper does not take longer than 'fixing' the userguide
and reverting that at a later stage.
If you are in the mood, I would appreciate if you could come up with a
prototype. I do not really have an idea how that wrapper should look like (but
I agree on your point about the procedure).
comment:43 by , 17 years ago
Subject: Re: unicode error when invoking lyx2lyx
If you are in the mood, I would appreciate if you could come up with a
prototype. I do not really have an idea how that wrapper should look like
(but I agree on your point about the procedure).
In python it would look like
def normalize(text):
keep_characters = [0x03bc,...]
result =
convert =
for i in text:
if ord(i) in keep_characters:
if len(convert) > 0:
result = result + unicodedata.normalize(convert, NFC)
convert =
result = result + i
if len(convert) > 0:
result = result + unicodedata.normalize(convert, NFC)
return result
This is untested, the call of unicodedata.normalize is probably wrong, and
maybe also the string concatenation, but I guess you get the idea.
Georg
comment:44 by , 17 years ago
blocked: | → 3976 |
---|
comment:45 by , 17 years ago
Keywords: | patch removed |
---|
the normalization change is in (rev. 18991), so the docs should no more fail on
this.
The OHM/OMEGA issue is still open.
Perhaps someone with more python knowledge want to have a go at the wrapper.
comment:46 by , 17 years ago
Note that the error message can still occur even if the ohm/micro problem is
fixed, see #3976#c11 for details.
comment:47 by , 17 years ago
Severity: | critical → major |
---|
The docs don't make problem thanks to fix of rev. 18991, so setting the severity
down to major.
comment:48 by , 17 years ago
blocked: | → 4021 |
---|
comment:49 by , 17 years ago
This one got worse again after revision 19113: both MU and OHM are replaced
by '???' in the reversion.
comment:50 by , 17 years ago
comment 51 is wrong: the mu is reverted correctly, only the OHM isn't, because
it is swapped with OMEGA. so the status of this bug remains the same as before
revision 19113.
comment:51 by , 17 years ago
attachments.isobsolete: | 0 → 1 |
---|
comment:52 by , 17 years ago
Milestone: | 1.5.0 → 1.5.1 |
---|
comment:53 by , 17 years ago
Cc: | removed |
---|
comment:54 by , 17 years ago
Milestone: | 1.5.1 → 1.5.2 |
---|
1.5.1 is an emergency relelase. Moving all to 1.5.2
comment:55 by , 17 years ago
Milestone: | 1.5.2 → 1.5.3 |
---|
1.5.2 is frozen. This bug has to be postponed to 1.5.3.
comment:57 by , 16 years ago
Milestone: | 1.5.4 → 1.5.5 |
---|
Soft freeze for 1.5.4: from now on, only critical bugs and regressions are
allowed.
comment:58 by , 16 years ago
Cc: | added |
---|---|
Keywords: | patch added |
Hello Georg,
could please have a look if this Unicode lyx2lyx patch is correct?:
#3313#c57
comment:59 by , 16 years ago
Keywords: | patch removed |
---|---|
Resolution: | → fixed |
Status: | reopened → closed |
the wrapper is in:
http://www.lyx.org/trac/changeset/23227
http://www.lyx.org/trac/changeset/23228
comment:60 by , 16 years ago
Subject: Re: unicode error when invoking lyx2lyx
14:35 -------
the wrapper is in:
You are too fast for me to follow :-) But it indeed looks correct.
Georg
comment:61 by , 16 years ago
You are too fast for me to follow :-)
José had a look at the patch.
But it indeed looks correct.
Thanks. Good to know.
LyX-testfile