Opened 2 years ago

Closed 2 years ago

#1487801 closed Bugs (fixed)

iconv/mbstring encoding conversion issues

Reported by: mrjoel Owned by:
Priority: 5 Milestone: 0.6-beta
Component: Core functionality Version: 0.5.1
Severity: normal Keywords:
Cc:

Description

I have received several emails over the last year which triggered errors in character set conversion, so I'm reporting the issue which occurred again with the 0.5.1 release. Ticket 1486375 looks like it was addressing a very similar issue - I'm not sure if it's the same or close, so I'm opening the new ticket instead of reopening the old one.

The error message received is:

PHP Error: Could not convert string from GBK to UTF-8. Make sure iconv/mbstring is installed or lib/utf8.class is available. in (...)/roundcubemail-0.5.1/program/include/main.inc on line 309

This is running using Debian Lenny PHP packages (libapache2-mod-php5 version 5.2.6.dfsg.1-1+lenny9) which has iconv and mbstring enabled (checked with phpinfo()).

The biggest issue this causes isn't being unable to read the message itself, but that it generates a "Server Error! (OK)" message and ends up blocking access to all other messages in that view window.

Attachments (4)

BrokenMessage (3.5 KB) - added by mrjoel 2 years ago.
Sanitized email which triggers the error
BrokenMessage2 (3.3 KB) - added by mrjoel 2 years ago.
multipart-mime-encoding-patch (5.3 KB) - added by mrjoel 2 years ago.
Patch for spanned mime encoded parts
JumboEncodingTestMessage (1.6 KB) - added by mrjoel 2 years ago.
Test message for spanned MIME encoded parts

Download all attachments as: .zip

Change History (22)

Changed 2 years ago by mrjoel

Sanitized email which triggers the error

comment:1 Changed 2 years ago by alec

  • Milestone changed from later to 0.6-beta

"Server Error! (OK)" is triggered only in Internet Explorer. All probably, because the message contains some characters which doesn't match GBK charset. We are using rc_utf8_clean function to clean up such characters, but maybe there's a bug.

comment:2 Changed 2 years ago by mrjoel

If the "Server Error! (OK)" is expected to only appear in IE, then there's a bug. The above message was just tested with Firefox 4.0 betas, but has occurred in Firefox 3.6, IE, and Safari.

comment:3 Changed 2 years ago by alec

Hmm... works for me using FF 3.6.13. Are you sure, you're using 0.5.1?

comment:4 Changed 2 years ago by mrjoel

Yep, I'm using 0.5.1 (the non-GPL package so I don't have to pull the other dependencies separately). I've had the issue recur several times (from 0.4 or so at least), but am just now getting around to reporting it. I also just purged the Maildir folder on the IMAP side and put the attached message as the only mail in the Maildir to ensure it was attached properly and got the error.

Sounds like an environmental difference - aside from the PHP version and method (mod_php), confirming that iconv and mbstring are enabled via phpinfo(), and RC 0.5.1, what other factors should I provide? Is there a simpler stand-alone test I can run in my environment?

comment:5 Changed 2 years ago by alec

It works for me with IE too. I'm using PHP 5.3.5. You could make tests with disabled mbstring first and then with disabled both mbstring and iconv.

comment:6 Changed 2 years ago by mrjoel

I've updated to PHP 5.3.3 (Debian package 5.3.3-7), and confirmed that the issue is still occurring with the base package as well as with PHP rebuilt to exclude iconv and mbstring (test 1), as well as mbstring excluded and iconv included (test 2).

Doing some further testing, I added statements in ./program/include/main.c to capture when the rcube_error_handler was triggered, and have two different instances:

(errno: 8, errstr: iconv(): Detected an incomplete multibyte character in input string).

(errno: 8, errstr: iconv(): Detected an illegal character in input string)

I got the same errors with and without mbstring enabled (obviously iconv enabled).

The encoding being triggered (the $from parameter to rcube_charset_convert) is indeed GB2312, as expected from the message. It is also not matching anything from mbstring, and falling through to the error reporting at the end of the function.

Further checking (with logs of debug_backtrace() inserted) shows that it's failing converting the MIME string (decode_mime_string -> _decode_mime_string_part -> rcube_charset_convert).

I'm not familiar with the call structure, but in addition to resolving the charset decoding issue, is there something that can be done so that failure decoding a single message doesn't make the entire mailbox unviewable?
For reference (mine as much as anything), the base64 decoded bytes of the message in hexdump are:

00000000  30 37 34 35 33 36 37 33  30 31 31 38 38 33 35 34  |0745367301188354|
00000010  36 cd c6 bc f6 d2 bb bf  ee c3 e2 b7 d1 b5 e7 ca  |6...............|
00000020  d3 b2 fa c6 b7 20 d6 d5  c9 ed c3 e2 b7           |..... .......|

Changed 2 years ago by mrjoel

comment:7 Changed 2 years ago by mrjoel

After posting the above hex dump I realized it was from a different message that also triggers the error, so it's been attached as BrokenMessage2. Base 64 decoding of the first part of the MIME encoded subject (MDc0NTM2NzMwMTE4ODM1NDbNxrz20ru/7sPit9G158rTsvrGtyDW1cntw+K3) is what gives the above hex dump (and is what iconv was throwing the error on based on the stacktrace argument values).

According to Wikipedia (https://secure.wikimedia.org/wikipedia/en/wiki/GB_2312), the EUC-CN encoding of the GB2312 character set uses "Two bytes ... to represent every character not found in ASCII". If that's the case, then the odd number of high-bytes does indeed seem to indicate that the last character should be considered incomplete, since it would start with the 0xb7, but doesn't have a second byte for that character (there is a stray 0x20 in there as a single byte character, at least appears correctly aligned on two byte character boundary). It's late and I'm not too savvy on character encodings so I could be off on all of this, but it looks like the message as delivered has an incorrect encoding (intentional or not, I'm not to say).

That said, it seems to me even more that RC should have some means of indicating a single message as "unable to decode" (whether a faulty message or coding error) or some such, and continue processing, as opposed to allowed what easily could be a message with intentionally broken encoding from stopping display of all mail.

comment:8 Changed 2 years ago by mrjoel

Ok, in looking further, it looks like the original MUA base64 encoded the string, then broke the string into multiple segments (two for the subject in the provided messages) which happened to not be character aligned in the original encoding, so when decode_mime_string calls _decode_mime_string_part it only ends up with a single part which is misaligned. However, when I concatenate the base64 parts of each of the two subject parts, then decode and convert, it works correctly (odd with the 07453673011883546 sequence at the beginning and end, but hey, it's spam).

Looking it up, this is allowed, since RFC 2047 states that

An 'encoded-word' may not be more than 75 characters long, including
'charset', 'encoding', 'encoded-text', and delimiters.  If it is
desirable to encode more text than will fit in an 'encoded-word' of
75 characters, multiple 'encoded-word's (separated by CRLF SPACE) may
be used.

It only references the 'encoded-word' portion, which is the base64-encoded text. I haven't found anything (quick search only) that indicates that it needs to align with the original encoding, which is the assumption that RoundCube seems to be making.

So, I suggest making two changes (may be useful to split the ticket once confirmed):

  1. Modify decode_mime_string to not have a separate call to _decode_mime_string_part (and thus the potential for broken iconv calls) for each segment, but rather append all parts together and then call rcube_charset_convert on the entire part. I'll send a patch for this in the next few days.
  1. Modify the handling around decoding/formatting so that the decoding of a single message doesn't prevent other correct messages from being displayed in that view. Otherwise (even with 1. fixed), I could send an intentionally broken message (e.g. only the first part of these messages) and prevent a user from seeing other messages. I'm not as familiar with the message handling on this, but will try and take a look at something for this as well.

I'm still missing how the message is working in your environments though, what is displayed for the message line subject in the browser? What does execution of the following commands provide for the subject strings? For me, iconv complains: "iconv: incomplete character or shift sequence at end of buffer".

echo -n "MDc0NTM2NzMwMTE4ODM1NDbNxrz20ru/7sPit9G158rTsvrGtyDW1cntw+K3" | base64 -d - | iconv -f GB2312

comment:9 Changed 2 years ago by mrjoel

And reading further in RFC 2047, the MUA behavior for breaking it across boundaries is clearly broken:

Each 'encoded-word' MUST represent an integral number of characters.
A multi-octet character may not be split across adjacent 'encoded-
word's.

So it's broken, but I still suggest handling this case, even if for "be liberal in what you accept".

The relevant RFC section for what to do when RC can't handle the message is:

If the mail reader does not support the character set used, it may
(a) display the 'encoded-word' as ordinary text (i.e., as it appears
in the header), (b) make a "best effort" to display using such
characters as are available, or (c) substitute an appropriate message
indicating that the decoded text could not be displayed.

comment:10 Changed 2 years ago by mrjoel

Actually, the more appropriate section from the RFC section for what to do when
RoundCube can't handle the message is (case 2 being applicable for the two attached messages):

It is possible that an 'encoded-word' that is legal according to the
syntax defined in section 2, is incorrectly formed according to the
rules for the encoding being used.   For example:

(1) An 'encoded-word' which contains characters which are not legal
    for a particular encoding (for example, a "-" in the "B"
    encoding, or a SPACE or HTAB in either the "B" or "Q" encoding),
    is incorrectly formed.

(2) Any 'encoded-word' which encodes a non-integral number of
    characters or octets is incorrectly formed.

A mail reader need not attempt to display the text associated with an
'encoded-word' that is incorrectly formed.  However, a mail reader
MUST NOT prevent the display or handling of a message because an
'encoded-word' is incorrectly formed.

I'll create a patch shortly (few days) for both issues.

comment:11 Changed 2 years ago by alec

Feel free to provide patches. For now I just want to says that ajax response for these two messages contains following json_encoded subjects

"07453673011883546\u01bc\u04bb\u0475\u04f2\u01b7 \u057f\u022b\u04fd\u013f07453673011883546"
"728231524286809\u01bc\u04bb\u0475\u04f2\u01b7 \u057f\u022b\u04fd\u013f728231524286809"

And it works for me using IE8.

comment:12 follow-up: Changed 2 years ago by mrjoel

Thanks for providing that, the escaped unicode points look incorrect though (really stopped looking after the first one, although again I'm far from an encoding expert). Looking at the first character - U+01bc, is a Latin capital letter tone five, which while I don't know what it is, isn't expected in a Chinese charset, and further isn't even defined in a GB2312 to Unicode mapping Link 1, Link 2. Also, the number of escaped unicode points seems very low/odd - 5 to the left of the space and 4 to the right out of 20 and 26 bytes respectively. What were you using to capture the response?

Assuming the EUC encoding for GB2312 (fixed two bytes for non-ASCII characters), the first non-ASCII byte sequence is 0xcdc6 as found by:

echo "MDc0NTM2NzMwMTE4ODM1NDbNxrz20ru/7sPit9G158rTsvrGtyDW1cntw+K30crVv7TIq8fytefK073axL+joTA3NDUzNjczMDExODgzNTQ2" | base64 -d | hexdump -C

00000000  30 37 34 35 33 36 37 33  30 31 31 38 38 33 35 34  |0745367301188354|
00000010  36 cd c6 bc f6 d2 bb bf  ee c3 e2 b7 d1 b5 e7 ca  |6...............|
00000020  d3 b2 fa c6 b7 20 d6 d5  c9 ed c3 e2 b7 d1 ca d5  |..... ..........|
00000030  bf b4 c8 ab c7 f2 b5 e7  ca d3 bd da c4 bf a3 a1  |................|
00000040  30 37 34 35 33 36 37 33  30 31 31 38 38 33 35 34  |0745367301188354|
00000050  36

Taking off the high bits (subtracting 0x8080) gives the GB2312 character of 0x4d46, which according to the Unicode mapping in Link 2 above, is Unicode character U+63a8, which looks visually identical to what I get in the console when replacing "hexdump -C" with "iconv -f GB2312" in the above command.

If you don't mind, what do you get when running the test command of only the first string part (at the end of comment 8) on the host where you're testing the message and getting it displayed? Also, what platform are you running on, with what libc or equivalent (iconv --version)? I'm on Debian with eglibc 2.11.1-20.

Finally, when displaying in the browser where you have it working, does it only display the seven characters (with a space in between) or does it show the full string as expected:

728231524286809推荐一款免费电视产品 终身免费收看全球电视节目!728231524286809

comment:13 in reply to: ↑ 12 Changed 2 years ago by alec

Replying to mrjoel:

What were you using to capture the response?

Set debug_level=9, which enables browser console or use Firebug or sth.

If you don't mind, what do you get when running the test command of only the first string part (at the end of comment 8) on the host where you're testing the message and getting it displayed? Also, what platform are you running on, with what libc or equivalent (iconv --version)? I'm on Debian with eglibc 2.11.1-20.

# iconv --version
iconv (GNU libc) 2.11.2
# LC_ALL=C; echo -n "MDc0NTM2NzMwMTE4ODM1NDbNxrz20ru/7sPit9G158rTsvrGtyDW1cntw+K3" | base64 -d - | iconv -f GB2312
07453673011883546iconv: illegal input sequence at position 17
# LC_ALL=pl_PL.utf8; echo -n "MDc0NTM2NzMwMTE4ODM1NDbNxrz20ru/7sPit9G158rTsvrGtyDW1cntw+K3" | base64 -d - | iconv -f GB2312
07453673011883546推荐一款免费电视产品 终身免iconv: niekompletny znak lub sekwencja sterująca na końcu bufora

Finally, when displaying in the browser where you have it working, does it only display the seven characters (with a space in between) or does it show the full string as expected:

728231524286809推荐一款免费电视产品 终身免费收看全球电视节目!728231524286809

The full string, proper numbers but "garbage" characters inside.

comment:14 follow-up: Changed 2 years ago by mrjoel

Ok, at least one environmental difference I just found is that I had debug_level set to 5 when I was getting the errors. Removing the level 4 from the bitfield allows the messages to be displayed in the index and selected to display the message.

Oddly, however, the subject displayed in the message index listing is different (matches the JSON escaped string in comment 11, with 5 chars before the middle space, and four after) than what is shown when actually viewing the message. When viewing the message, the subject looks closer to what it should be by the number of characters, but it doesn't actually display the characters, but a lot of placeholder <?> characters. This is in Firefox where the above comments are correctly visible. The message body is also displayed using actual glyphs instead of the placeholders.

comment:15 in reply to: ↑ 14 Changed 2 years ago by alec

Replying to mrjoel:

Oddly, however, the subject displayed in the message index listing is different (matches the JSON escaped string in comment 11, with 5 chars before the middle space, and four after) than what is shown when actually viewing the message.

It's because subject on the list is sent via AJAX and need to be cleaned up. If not cleaned IE will display error and will not display the list.

comment:16 follow-up: Changed 2 years ago by mrjoel

I'm attaching a patch that makes the MIME decoding more robust for broken mailers which split encoded-text between underlying characters. I have also attached a test message that was used to test the new parsing and test permutations.

One style question is that the existing decode_mime_string function used a combination of iterative and recursive approach, iterative for successive encoded parts, and one recursive call for trailing ASCII portions. Based on that and the existing comment I've made the new approach iterative as well except for the optional final trailing ASCII portion.

As for performance of the unpatched vs. patched approach, non-MIME encoded strings are identical since no processing is done in either case. I ran some basic performance tests against the test batch of messages (including the artificially crafted bad one) and against a few page views of my spam box. I didn't do formal averages of multiple runs of each, but just basic function timing, and the results looks comparable (fewer function calls, more string concat work).

The one thing that is still a lingering concern is that you say it is working for you before the patch. If you could please just confirm that you actually see the 'prepatch' subject displayed below, and not the 'patched' subject even without the patch, that would resolve that concern. While characters were previously displayed, they weren't actually the right characters, just garbage.

'prepatch' subject display:

Message:  07453673011883546ƼһѵӲƷ տȫӽĿ07453673011883546
Message2: 728231524286809ƼһѵӲƷ տȫӽĿ728231524286809

'patched' subject display:

Message : 07453673011883546推荐一款免费电视产品 终身免费收看全球电视节目!07453673011883546
Message2: 728231524286809推荐一款免费电视产品 终身免费收看全球电视节目!728231524286809

The issue of having debug_level&4 set is still unresolved. This patch fixes some instances of broken encodings causing it, but a partial message (only one encoded string which has an incomplete multi-byte character) will still cause the issue (I have an example in my spam corpus which I had to double check to make sure it wasn't an issue caused with my patch). It's likely to be caused by other issues as well. In any case, not sure if you know off hand what is causing it or if I should create a separate ticket for that (with a sample message to break it), please let me know which you'd prefer.

The log message corresponding to the decoding error that highlights where the error is occurring is

PHP Warning:  Cannot modify header information - headers already sent in ./roundcubemail-0.5.1/program/include/rcube_json_output.php on line 233

Changed 2 years ago by mrjoel

Patch for spanned mime encoded parts

Changed 2 years ago by mrjoel

Test message for spanned MIME encoded parts

comment:17 in reply to: ↑ 16 Changed 2 years ago by alec

Replying to mrjoel:

The one thing that is still a lingering concern is that you say it is working for you before the patch. If you could please just confirm that you actually see the 'prepatch' subject displayed below,

Message:  07453673011883546ƼһѵӲƷ տȫӽĿ07453673011883546
Message2: 728231524286809ƼһѵӲƷ տȫӽĿ728231524286809

Yes. I see prepatch version of the subject.

The issue of having debug_level&4 set is still unresolved.

In my opinion this level shouldn't be never used, it has other side effects too.

comment:18 Changed 2 years ago by alec

  • Resolution set to fixed
  • Status changed from new to closed

Your patch was ok, but I've written this again from scratch in a clearer way. The performance of both is better that previous version (at least for headers with few encoded words). Fixed in [8df56e61].

Note: See TracTickets for help on using tickets.