Opened 17 years ago

Closed 16 years ago

Last modified 16 years ago

#3313 closed defect (fixed)

unicode error when invoking lyx2lyx

Reported by: Uwe Stöhr Owned by: José Matos
Priority: high Milestone: 1.5.5
Component: lyx2lyx Version: 1.5.0svn
Severity: major Keywords:
Cc: jamatos@…, j.spitzmueller@…, georg.baum@…, anek@…

Description

Try to export the attached LyX-file to LyX 1.3.x or 1.4.x-format. Result is this
error:

An error occurred whilst running python -tt "C:/Program Files (x86)/LyX 1.5beta2
-xx
Traceback (most recent call last):

File "C:/Program Files (x86)/LyX 1.5beta2-xx-02-2007/Resources/lyx2lyx/lyx2lyx

", line 101, in <module>

sys.exit(main(sys.argv))

File "C:/Program Files (x86)/LyX 1.5beta2-xx-02-2007/Resources/lyx2lyx/lyx2lyx

", line 95, in main

file.write()

File "C:\Program Files (x86)\LyX 1.5beta2-xx-02-2007\Resources\lyx2lyx\LyX.py"

, line 274, in write

self.output.write(line.encode(self.encoding)+"\n")

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2044' in position

3: ordinal not in range(256)

Error: Cannot convert file

Attachments (8)

test3.lyx (40.8 KB ) - added by Uwe Stöhr 17 years ago.
LyX-testfile
test2.lyx (1.8 KB ) - added by anek@… 17 years ago.
Better test file?
test_export6.lyx (1.0 KB ) - added by anek@… 17 years ago.
Test case for Lyx1.4 export
patch (5.2 KB ) - added by anek@… 17 years ago.
Patch previously posted to the lyx-devel list
revert_unicode_070530.txt (4.5 KB ) - added by anek@… 17 years ago.
Improved function to replace unicode to be included in lyx_1_5.py
new-patch (696 bytes ) - added by anek@… 17 years ago.
Fixes characters with diaeresis (like ä)
3313.diff (3.7 KB ) - added by Juergen Spitzmueller 17 years ago.
change normalization from compatibility to canonical decomposition
3313.2.diff (2.4 KB ) - added by Juergen Spitzmueller 16 years ago.
the wrapper

Download all attachments as: .zip

Change History (67)

by Uwe Stöhr, 17 years ago

Attachment: test3.lyx added

LyX-testfile

comment:1 by Uwe Stöhr, 17 years ago

Milestone: 1.5.0

comment:2 by Georg Baum, 17 years ago

That happens if a character cannot be encoded in the encoding of the old file
format. We can probably not do much here, only give a better error message.

comment:3 by anek@…, 17 years ago

The case of an en-dash gives an error:
...

return codecs.charmap_encode(input,errors,encoding_map)

UnicodeEncodeError: 'charmap' codec can't encode character u'\u2013' in position 32: character maps to
<undefined>
Error: Cannot convert file

Would it be possible to catch this error (and the one for an em-dash) and convert to -- (---)?
For viewing and PostScript export, the problem was solved in #3565.

comment:4 by Georg Baum, 17 years ago

Yes. See revert_misc_unicode() in
http://www.lyx.org/trac/browser/lyx-devel/branches/personal/baum/playground/lib/lyx2lyx/lyx_1_5.py
for a start. There are several unciode characters that can be converted to LyX
insets (ore more trivial simple characters as in the dash case).

comment:5 by anek@…, 17 years ago

I tried to make a function based on George's code. This seems to work.
I have never worked with any part of the LyX code, and never with unicode, so the code may be very
ugly, but it works (at least for me).
Some problems:

  • Relative paths (to unicodesymbols) doesn't work for me (perhaps because I presume the wrong parent

directory). How do I get around that?

  • I'm not sure where in the Conversion hub to call the function. At convert 246 and revert 270 works

for me, but this was a random choice.

Further, I have not implemented the case of preamble flags. It shouldn't be difficult, but I don't know
where in the document.header to put them and I don't know if there are other things that need to be
considered (e.g. document class). The current code captures most "simple cases" like registered trade
mark, en dash, e acute, etc.

So here's the code. Please feel free to slash it. Btw is there a style guide for LyX Python scripts?

def revert_unicode(document):

# Transform unicode symbols according to the unicode list
# Absoulte path used. Should be changed.
fp=open('/Users/anek/LyX_dev/nightly/LyX_May_25.app/Contents/Resources/unicodesymbols','r');
spec_chars = {};
for line in fp.readlines():

if line[0] != '#':

line=line.replace('"',); #remove all qoutation marks
try:

# flag1 and flag2 are preamble flags NOTE! Not implemented
# flag1=# -> no flags, flag2=# -> only flag 1
[ucs4,command,flag1,flag2] =line.split(None,3);
ert_intro='\n\n
begin_inset ERT\nstatus collapsed\n
begin_layout Standard\n
backslash

\n';

ert_outro='\n
end_layout\n
end_inset\n\n';
if command[0:2] == '

':

command = command.replace('

', ert_intro);
command = command + ert_outro;

spec_chars[unichr(eval(ucs4))] = [command, flag1, flag2];

except:

pass

fp.close();
for i in range(len(document.body)):

for j in range(len(document.body[i])):

if spec_chars.has_key(document.body[i][j]):

document.body[i] = document.body[i][0:j] + spec_chars[document.body[i][j]][0] +

document.body[i][j+1:]

comment:6 by anek@…, 17 years ago

Forgot to mention that I put the function in lyx_1_5.py

comment:7 by Georg Baum, 17 years ago

This is even more than I had in mind. Some notes:

  • You can find python code to read the unicodesymbols file also in

development/tools/unicodesymbols.py

  • This function should only be called for revert, not for convert. The best

place is at the 249->248 step.

  • In general, it needs some safety checks added, and it needs to behave

differently if the character is alreay in ERT.

  • Some characters could be converted to native LyX insets, e.g. the different

quotes. That would have the advantage that a round trip would not add ERT.

If you want to have this included, get Jose to have a look at the remaining
problems.

by anek@…, 17 years ago

Attachment: test_export6.lyx added

Test case for Lyx1.4 export

comment:8 by anek@…, 17 years ago

attachments.isobsolete: 01

comment:9 by Uwe Stöhr, 17 years ago

Cc: anek@… added
Keywords: patch added

--> (http://bugzilla.lyx.org/attachment.cgi?id=1846&action=view)

Improved function to replace unicode to be included in lyx_1_5.py

Please send it to the lyx-devel list so that it can be approved by other
developers and then be committed to the LyX sources.

by anek@…, 17 years ago

Attachment: revert_unicode_070530.txt added

Patch previously posted to the lyx-devel list

comment:10 by anek@…, 17 years ago

attachments.isobsolete: 01

comment:11 by ablage_p@…, 17 years ago

hello, (i'm no expert)
i had this problem too. i wanted to convert from LyX 1.5.0rc1 lyxformat 271 to
lyxformat 245. in my file i had quote signs in a footnote. after i removed them
manually the conversion worked.

comment:12 by Uwe Stöhr, 17 years ago

Cc: jamatos@… added

Anders, José, what about the patch now?

comment:13 by anek@…, 17 years ago

The status is that the current patch works and has been has been stable for me, but since I am not
familiar with the LyX document format there may of course be mistakes. I tried the best I could to test for
any such.
I didn't get any response on whether to add commands to the preamble, so I haven't done anything there.
This means that the conversion to 1.4 will work but LyX will complain about missing package for some of
the special characters. Still, I think is better than the current failure of the conversion.
Finally, for a unicode character not in the unicodesymbols list, the conversion will fail as before.

comment:14 by anek@…, 17 years ago

Is someone working on something better here?
Otherwise I would suggest that the patch goes in (unless someone found errors in the code) so that there
is a little time for feedback before RC2 (and to make life a little bit easier for us working with colleagues
who use 1.4 ;-)

comment:15 by Uwe Stöhr, 17 years ago

Anders, José applied your patch:
http://www.lyx.org/trac/changeset/18890

Can I mark this bug as fixed or are there some issues left?

by anek@…, 17 years ago

Attachment: new-patch added

Fixes characters with diaeresis (like ä)

comment:16 by Uwe Stöhr, 17 years ago

is it possible to change the revised name of the translated document to
xxx_14.lyx (instead of xxx.lyx14). The current version requires a manual
renaming on the Mac, which is a mess.

Please open a new bug report for this.

Tested the patched LyX and found a bug I should have seen (since I am Swedish I
should have tested for ä and ö...).

Please send this patch to the lyx devel-list.

the test case I submitted to this bug has problems in LyX1.5. It complains
that some characters are not representable in the chosen encoding.

The problem is that you used the character 2264 "≤" in a formula. But there you
must create this character using the command "\le" or by using the math toolbar.

comment:17 by anek@…, 17 years ago

Please send this patch to the lyx devel-list.

Done

Please open a new bug report for this.

Done -- #3934

The problem is that you used the character 2264 "≤" in a formula. But there you
must create this character using the command "\le" or by using the math toolbar.

Yes, that was the point. I copied the character from a unicode text. Since it pastes, you would assume
that the character is converted to a command in the process when it is converted to a pdf (as it is when
it is converted to LyX1.4). In other words, the end-user would expect anything that is possible to paste
to come out in print.

comment:18 by Uwe Stöhr, 17 years ago

Subject: Re: unicode error when invoking lyx2lyx

In other words, the end-user would expect anything that is possible to paste to come out in print.

I see, so another bug report is needed for this. Could you do this please?

comment:19 by Uwe Stöhr, 17 years ago

Resolution: fixed
Status: newclosed

The last part to fix this is in:
http://www.lyx.org/trac/changeset/18918

comment:20 by Uwe Stöhr, 17 years ago

Resolution: fixed
Severity: normalblocker
Status: closedreopened

Not fixed: Open the UserGuide and try to export it to LyX 1.4.x format:

File "C:\Program Files (x86)\LyX 1.5rc2-27-05-2007\Resources\lyx2lyx\LyX.py",

line 278, in write

self.output.write(line.encode(self.encoding)+"\n")

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u03bc' in positio

0: ordinal not in range(256)

Error: Die Datei kann nicht konvertiert werden

I expect this should be easy to fix, Anders?

A blocker for LyX 1.5.0.

comment:21 by anek@…, 17 years ago

Does it go as far as to reverse_unicode before crashing? I get in the console
File "/Users/anek/Desktop/LyX_15rc2.app/Contents/Resources/lyx2lyx/lyx2lyx", line 101, in ?

sys.exit(main(sys.argv))

File "/Users/anek/Desktop/LyX_15rc2.app/Contents/Resources/lyx2lyx/lyx2lyx", line 95, in main

file.write()

File "/Users/anek/Desktop/LyX_15rc2.app/Contents/Resources/lyx2lyx/LyX.py", line 278, in write

self.output.write(line.encode(self.encoding)+"\n")

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u03bc' in position 0: ordinal not in
range(256)
Error: Cannot convert file

If it gets to the reverse_unicode function, the conversion will fail if the symbol is not in the
unicodesymbols list.
A search for \u03bc gives
<!ENTITY mu CDATA "&#956;" -- greek small letter mu, U03BC ISOgrk3 -->
the closest I find in the unicodesymbols list is
0x00b5 "
textmu" "textcomp" "force" # µ MICRO SIGN
which I guess is not the same (as you understand, I am far from an expert in unicode). So I guess this is
why the script failed.
There was a discussion before about greek characters before on the list and if I remember correctly,
the consensus was that these should not be added to the unicode list. Instead the text should be
marked as greek.
So, provided the error is what I am guessing here, I see three options: expand the unicodesymbols list
(a lot of work), mark the relevant text section as greek or input it as a symbol (should be easy) or add a
fail-safe check in lyx2lyx that translate remaining non-translated unicode symbols to question-marks
or similar (preferable I guess, but probably tricky since I guess the implementation will depend on to
which encoding the document is translated to).

comment:22 by anek@…, 17 years ago

The crash is caused by the µ in table 3 in section 6.8.3.

comment:23 by Juergen Spitzmueller, 17 years ago

Cc: j.spitzmueller@… added

The problematic character *is* the correct µ symbol, which is in the
unicodesymbols list and does get output as \textmu in LaTeX. So something else
must go wrong.
BTW it doesn't crash on Linux (only output an error message).

comment:24 by anek@…, 17 years ago

Sorry for my vagueness (it was quite late...) LyX dosn't crash on the Mac either. The conversion script
fails and results in an error message.
Removing µ resolves this. And I can replicate the error by just creating a new document and pasting (or
typing) µ. I can't understand why this is. Taking a similar character like ± the conversion works (as
known, I need to add \usepackage{textcomp} manually to the Preamble to get any pdf output).
What I wonder about is the difference between the character we have:
http://www.fileformat.info/info/unicode/char/03bc/index.htm
and the micro sign
http://www.fileformat.info/info/unicode/char/00b5/index.htm
The unicodesymbols list should be for the latter.
Adding
0x03bc "
textmu" "textcomp" "force" # µ MICRO SIGN
to the unicodesymbols list solves the problem.

The document still won't open in LyX1.4 though (worse, it gets into an eternal loop...).
I think this could be related to
#3404
although in this case the console claims some missing \end_layout (and not too many). There's no
mentioning of unicode errors in the error message.

So my suggestion is to add an entry for 0x03bc to the unicodesymbols (I don't dare to put a patch with
the line above since I don't know if that is the correct command to use).

comment:25 by Uwe Stöhr, 17 years ago

Subject: Re: unicode error when invoking lyx2lyx

What I wonder about is the difference between the character we have:
http://www.fileformat.info/info/unicode/char/03bc/index.htm
and the micro sign
http://www.fileformat.info/info/unicode/char/00b5/index.htm

OK, this is the bug "03bc" is the Greek character, not the µ character for e.g. "µm" I'll cottect
this in the manuals.

The unicodesymbols list should be for the latter.
Adding
0x03bc "
textmu" "textcomp" "force" # µ MICRO SIGN
to the unicodesymbols list solves the problem.

We agreed not to translate Greek characters. 03bc looks btw. a bit different than "00b5".

Thanks for investigating.

comment:26 by Juergen Spitzmueller, 17 years ago

OK, this is the bug "03bc" is the Greek character, not the µ character for
e.g. "µm" I'll cottect this in the manuals.

This does not fix it for me. As I wrote in comment 26: this is already the
correct character, but it somehow gets interpreted as the greek mu.

Testcase:

  • new document
  • M-x unicode-insert 0x00b5
  • export to 1.4

=> same error.

comment:27 by anek@…, 17 years ago

This does not fix it for me.

Not even if you modify the unicodesymbols file? (OK, it's technically not a fix, but does that make the
conversion work?)

As I wrote in comment 26: this is already the correct character,

but it somehow gets interpreted as the greek mu.

So somewhere between the input and the call to revert_unicode the µ character gets re-interpreted.
I guess this should be filed as a new bug.

comment:28 by Juergen Spitzmueller, 17 years ago

Not even if you modify the unicodesymbols file?

I didn't try. But this does only hide the bug anyway.

So somewhere between the input and the call to revert_unicode the µ character
gets re-interpreted.

Looks like it.

I guess this should be filed as a new bug.

No. I think this is the bug.

comment:29 by Juergen Spitzmueller, 17 years ago

Cc: georg.baum@… added

I'm pretty sure this is a python bug:
MICRO is misread as '\u03bc' (GREEK SMALL LETTER MU). The same error occurs for
OHM, which is parsed as '\u03a9' (GREEK CAPITAL LETTER OMEGA).

I don't know what we can do about that. Georg?

comment:30 by Georg Baum, 17 years ago

The character is changed in revert_accent, probably by the call of
unicodedata.normalize. I don't think that this is special python behaviour. I
rather suspect that 0x03bc and 0x00b5 are considered equal by the unicode
standard.
If that is the case then they should not produce different output in LyX
either.

comment:31 by anek@…, 17 years ago

I rather suspect that 0x03bc and 0x00b5 are considered equal by
the unicode standard.

So it seems, see:
http://www.cs.tut.fi/~jkorpela/chars/si.html
(search for micro)

If that is the case then they should not produce different output in
LyX either.

Then I guess the easiest fix is to add the compatibility characters to the unicode symbols list with
identical commands?

comment:32 by Juergen Spitzmueller, 17 years ago

Then I guess the easiest fix is to add the compatibility characters to the
unicode symbols list with
identical commands?

No, I still think we should distinguish the unit symbols from the greek
characters, if possible.

comment:33 by anek@…, 17 years ago

No, I still think we should distinguish the unit symbols from the

greek characters, if possible.

Replacing
unicodedata.normalize("NFKD", ...
with
unicodedata.normalize("NFD", ...
and
unicodedata.normalize("NFKC", ...
with
unicodedata.normalize("NFC", ...
seems to do the trick, though I have no idea if this causes unwanted side-effects?

comment:34 by Juergen Spitzmueller, 17 years ago

I just found that out as well. However, it only works for MICRO, not for OHM.

comment:35 by anek@…, 17 years ago

However, it only works for MICRO, not for OHM.

Strange...
I guess this hints at a key question:
Can we be sure that (now or in the future) no conversion takes place anywhere in a Python, QT or other
routine that is out of our control?
AFAIU this would be completely legitimate according to the unicode standard.

comment:36 by Juergen Spitzmueller, 17 years ago

However, it only works for MICRO, not for OHM.

Strange...

No, it's not strange: MICRO/GREEK MU are defined "compatible", while OHM/OMEGA
are defined as "canonical equivalent".
NFKC/NFKD is "Compatibility Decomposition", so two compatible signs as MICRO/MU
are "normalized". NFD/NFC, in contrast, is "Canonical Decomposition", so the
compatible signs are left untouched, but the "canonical" (such as OHM/OMEGA)
are still normalized. I guess there's no normalization that leaves OHM/OMEGA
untouched.

comment:37 by Georg Baum, 17 years ago

Subject: Re: unicode error when invoking lyx2lyx


14:44 -------

No, it's not strange: MICRO/GREEK MU are defined "compatible", while

OHM/OMEGA

are defined as "canonical equivalent".
NFKC/NFKD is "Compatibility Decomposition", so two compatible signs as

MICRO/MU

are "normalized". NFD/NFC, in contrast, is "Canonical Decomposition", so

the

compatible signs are left untouched, but the "canonical" (such as

OHM/OMEGA)

are still normalized. I guess there's no normalization that leaves

OHM/OMEGA

untouched.

Correct (see http://www.unicode.org/reports/tr15/)

That does also mean that NFC/NFD should be used instead of NFKC/NFKD, both
in lyx2lyx and LyX (normalize_kc should be replaced by normalize_c).
An easy way to prevent normalization of OHM to OMEGA would be a wrapper
around the normalization function: It could split the input in chunks,
taking the critical characters as boundaries, and then only pass the
chunks without the critical characters to the real normalizer.

comment:38 by Uwe Stöhr, 17 years ago

Subject: Re: unicode error when invoking lyx2lyx

I rather suspect that 0x03bc and 0x00b5 are considered equal by the unicode
standard.

Not here on Windows: 0x03bc and 0x00b5 look different arial and Times New Roman. Furthermore some
LaTeX-fonts don't support Greek characters, only the 0x00b5 µ.
We cannot translate 0x03bc in unicodesymbols because then Greek users wil see ugly mu because
\textmu uses another font than original greek fonts.

by Juergen Spitzmueller, 17 years ago

Attachment: 3313.diff added

change normalization from compatibility to canonical decomposition

comment:39 by Uwe Stöhr, 17 years ago

Severity: blockercritical

This patch does the change proposed in comment 40 and fixes the reversion of
the MICRO sign. The wrapper idea is not yet implemented, so OHM is still a
problem.

But the patch is the first step and makes the UserGuide processable. As this is
a major bug I propose to put the patch in and I replace the OHM sign in the docs
by \textohm until the wrapper is ready.

comment:40 by anek@…, 17 years ago

I propose to put the patch in and I replace the OHM sign in the docs
by \textohm until the wrapper is ready.

Wouldn't it be better to have "both ohms" in the unicodesymbols file (with command \textohm) until the
wrapper is ready? In this way you won't need to revise the User's Guide and you also have a solution that
works for all documents.

comment:41 by Georg Baum, 17 years ago

Subject: Re: unicode error when invoking lyx2lyx



But the patch is the first step and makes the UserGuide processable. As
this is a major bug I propose to put the patch in and I replace the OHM
sign in the docs by \textohm until the wrapper is ready.

I would consider that an ugly hack, and I don't see the need for it:
Implementing that wrapper does not take longer than 'fixing' the userguide
and reverting that at a later stage. This is really easy text processing,
and it does not need to be very efficient either, since it is only called
on explicit user request.

comment:42 by Juergen Spitzmueller, 17 years ago

Implementing that wrapper does not take longer than 'fixing' the userguide
and reverting that at a later stage.

If you are in the mood, I would appreciate if you could come up with a
prototype. I do not really have an idea how that wrapper should look like (but
I agree on your point about the procedure).

comment:43 by Georg Baum, 17 years ago

Subject: Re: unicode error when invoking lyx2lyx



If you are in the mood, I would appreciate if you could come up with a
prototype. I do not really have an idea how that wrapper should look like
(but I agree on your point about the procedure).

In python it would look like

def normalize(text):

keep_characters = [0x03bc,...]
result =
convert =

for i in text:

if ord(i) in keep_characters:

if len(convert) > 0:

result = result + unicodedata.normalize(convert, NFC)
convert =

result = result + i

if len(convert) > 0:

result = result + unicodedata.normalize(convert, NFC)

return result

This is untested, the call of unicodedata.normalize is probably wrong, and
maybe also the string concatenation, but I guess you get the idea.

Georg

comment:44 by Uwe Stöhr, 17 years ago

blocked: 3976

comment:45 by Juergen Spitzmueller, 17 years ago

Keywords: patch removed

the normalization change is in (rev. 18991), so the docs should no more fail on
this.

The OHM/OMEGA issue is still open.

Perhaps someone with more python knowledge want to have a go at the wrapper.

comment:46 by Georg Baum, 17 years ago

Note that the error message can still occur even if the ohm/micro problem is
fixed, see #3976#c11 for details.

comment:47 by Uwe Stöhr, 17 years ago

Severity: criticalmajor

The docs don't make problem thanks to fix of rev. 18991, so setting the severity
down to major.

comment:48 by Juergen Spitzmueller, 17 years ago

blocked: 4021

comment:49 by Juergen Spitzmueller, 17 years ago

This one got worse again after revision 19113: both MU and OHM are replaced
by '???' in the reversion.

comment:50 by Juergen Spitzmueller, 17 years ago

comment 51 is wrong: the mu is reverted correctly, only the OHM isn't, because
it is swapped with OMEGA. so the status of this bug remains the same as before
revision 19113.

comment:51 by Uwe Stöhr, 17 years ago

attachments.isobsolete: 01

comment:52 by Juergen Spitzmueller, 17 years ago

Milestone: 1.5.01.5.1

comment:53 by Georg Baum, 17 years ago

Cc: georg.baum@… removed

comment:54 by lasgouttes, 17 years ago

Milestone: 1.5.11.5.2

1.5.1 is an emergency relelase. Moving all to 1.5.2

comment:55 by Juergen Spitzmueller, 17 years ago

Milestone: 1.5.21.5.3

1.5.2 is frozen. This bug has to be postponed to 1.5.3.

comment:56 by Juergen Spitzmueller, 16 years ago

Milestone: 1.5.31.5.4

1.5.3 is released.

comment:57 by Juergen Spitzmueller, 16 years ago

Milestone: 1.5.41.5.5

Soft freeze for 1.5.4: from now on, only critical bugs and regressions are
allowed.

by Juergen Spitzmueller, 16 years ago

Attachment: 3313.2.diff added

the wrapper

comment:58 by Uwe Stöhr, 16 years ago

Cc: georg.baum@… added
Keywords: patch added

Hello Georg,
could please have a look if this Unicode lyx2lyx patch is correct?:
#3313#c57

comment:59 by Juergen Spitzmueller, 16 years ago

Keywords: patch removed
Resolution: fixed
Status: reopenedclosed

comment:60 by Georg Baum, 16 years ago

Subject: Re: unicode error when invoking lyx2lyx


14:35 -------

the wrapper is in:

You are too fast for me to follow :-) But it indeed looks correct.

Georg

comment:61 by Juergen Spitzmueller, 16 years ago

You are too fast for me to follow :-)

José had a look at the patch.

But it indeed looks correct.

Thanks. Good to know.

Note: See TracTickets for help on using tickets.