Ticket #1484961 (reopened Bugs)

Opened 8 months ago

Last modified 6 weeks ago

Default charset is ignored

Reported by: joungkyun Owned by:
Priority: 8 Milestone: 0.2-stable
Component: IMAP connection Version: svn-trunk
Severity: normal Keywords: charset
Cc:

Description

If there is no charset in mail herder or multi part header, WU-imapd returns US-ASCII (call by iil_C_FetchStructureString function). So, charset is not null(US-ASCII) and default_charset configration of main.inc.php is ignored.

So, fix this problem attached patch file and row mail data. This patch is applied SVN revision 1246

Attachments

roundcubemail-1246-default-charset.patch (1.5 kB) - added by joungkyun 8 months ago.
fixed case of ignored default_charset
row-mail-data.txt (2.0 kB) - added by joungkyun 8 months ago.
example of raw mail data that has no charset
korean-msg.PNG (36.6 kB) - added by tensor 2 months ago.
korean-message-list.PNG (15.6 kB) - added by tensor 2 months ago.
rcube-list-default-chatset.patch (1.0 kB) - added by joungkyun 2 months ago.
fixed broken list if Subject has no CHARSET for crube 0.1.1
korean-list.PNG (13.2 kB) - added by tensor 2 months ago.
rcube-svn1941-broken-list.patch (504 bytes) - added by joungkyun 2 months ago.
rcube_0.2_broken_list.jpg (34.4 kB) - added by joungkyun 2 months ago.
rcube_0.2_good_list.jpg (33.9 kB) - added by joungkyun 2 months ago.
rcube_0.2_broken_body_with_cjk.patch (1.1 kB) - added by joungkyun 2 months ago.
patch for SVN revision 1941
broken_body_on_list_page.jpg (79.9 kB) - added by joungkyun 2 months ago.
broken_body_on_opening_message.jpg (102.7 kB) - added by joungkyun 2 months ago.
no_charset_header_on_body_structure.txt (1.8 kB) - added by joungkyun 2 months ago.
main.inc.php (15.4 kB) - added by joungkyun 2 months ago.
rcube_imap.patch (1.9 kB) - added by alec 7 weeks ago.
use rc_detect_encoding()
mb_detect_encoding.jpg (52.5 kB) - added by joungkyun 7 weeks ago.

Change History

Changed 8 months ago by joungkyun

fixed case of ignored default_charset

Changed 8 months ago by joungkyun

example of raw mail data that has no charset

  Changed 8 months ago by thomasb

  • severity changed from major to normal
  • milestone changed from 0.1.1 to later

In case the default charset is ISO-8859-1 or UTF-8 this could work since US-ASCII is a subset of them. But we should not in general replace all US-ASCII messages with the default charset. Unfortunately we cannot see whether the message was really sent with US-ASCII or if this was added by the IMAP server.

  Changed 8 months ago by joungkyun

In korea, use charset euc-kr or utf-8. (euc-kr is used widely than utf-8).

If charset of mail header or multi part header has no charset as attached raw mail data, iil_C_FetchStructureString function call IMAP command for get charset and any information, and then, imap server returns US-ASCII charset. (I used imap-2006f)

If this case, charset of roundcube set US-ASCII, and roundcube convert US-ASCII to UTF-8. (I configured default charset to euc-kr in main.inc.php). So, print ugly characters, and I can't understand this mail.

If there is no charset, I expect that convert from default charset (euc-kr) to utf-8, but actually roundcube convert from US-ASCII(ISO-8859-1) to UTF-8. Many korean users have this situations. This problem is sender's wrong, but Many e-mail is sent with this case. So, plez add exception case for this problem with my patch.

follow-up: ↓ 4   Changed 8 months ago by thomasb

  • status changed from new to closed
  • resolution set to wontfix

Again, RoundCube has to trust whatever the IMAP server responds. The default charset is only used if there's NO charset specified by the IMAP server. In your case there is one.

in reply to: ↑ 3   Changed 8 months ago by joungkyun

Replying to thomasb:

Again, RoundCube has to trust whatever the IMAP server responds. The default charset is only used if there's NO charset specified by the IMAP server. In your case there is one.

I don't understand about your answer. Imap server give wrong charset when mail has not charset, and mail is broken.

Anyway, I see, this patch is useless for you, but many multi byte mail is still broken on Roundcube mail. Thanks.

  Changed 7 months ago by alec

  • milestone changed from later to 0.1.2

follow-up: ↓ 8   Changed 2 months ago by tensor

  • milestone changed from 0.2-alpha to 0.2-stable

In trunk it appears to be fixed for message body. I tested with default_charset of EUC-KR. joungkyun, see the screenshot and tell us whether the text it is a Korean. There is an issue, though, the subject and sender are not decoded properly for message list.

  Changed 2 months ago by tensor

  • status changed from closed to reopened
  • type changed from Patches to Bugs
  • resolution deleted

Changed 2 months ago by tensor

Changed 2 months ago by tensor

Changed 2 months ago by joungkyun

fixed broken list if Subject has no CHARSET for crube 0.1.1

in reply to: ↑ 6   Changed 2 months ago by joungkyun

I tested with default_charset of EUC-KR. joungkyun, see the screenshot and tell us whether the text it is a Korean. There is an issue, though, the subject and sender are not decoded properly for message list.

message body is presented good korean. But list is broken. It's will need other path :), and attach patch file. Sorry, my patch for 0.1.1 and some hours ago, I will attach patch for 0.2.

Thanks.

  Changed 2 months ago by tensor

No patch necessary. That was a caching issue on my side or something was fixed in the trunk during the last several days. I haven't merged upstream changes for some time. Use a tool in #1485434 to reset the cache.

Changed 2 months ago by tensor

  Changed 2 months ago by alec

  • status changed from reopened to closed
  • resolution set to worksforme

joungkyun, please check current svn-trunk version. There was added charset detection some time ago and maybe this fixes your issue. Closing as this works for me also.

  Changed 2 months ago by joungkyun

  • status changed from closed to reopened
  • resolution deleted

I test with SVN 1941 revision. But, sadly still broken korean string some case and, korean utf8 filename is lost.

I check this problems now, and reporting again. Thanks.

  Changed 2 months ago by joungkyun

I find 2 problems on 0.2-beta or SVN trunk revision 1941.

One is broken attach file name that is made by RFC2231 encoding. This problem is report to http://trac.roundcube.net/ticket/1485468 and attached patch file.

And, there is another problem.

On mail list, there is mail subject that has CHARSET (Subject: =?B?UTF-8?xxxxxxx=), and next mail subject that has no CHARSET (Subject: 안녕하세요) is broken. This case is attach file 'rcube_0.2_broken_list.jpg'.

So, I attach rcube-svn1941-broken-list.patch

Changed 2 months ago by joungkyun

Changed 2 months ago by joungkyun

Changed 2 months ago by joungkyun

Changed 2 months ago by joungkyun

patch for SVN revision 1941

follow-up: ↓ 15   Changed 2 months ago by joungkyun

I found more 1 problems.

If mail body has not charset, IMAP server resturns default charset of itself as US-ASCII or X-UNKNOWN. But, Almost mail of country that use CJK (chinese, japanese, korean) charset is not US-ASCII or X-UNKNOWN. So, on CJK environment, almost mail that has not charset header is broken.

So, If default_charset is not ISO-8859-1, need to replace US-ASCII(or X-UNKNOWN) of value that returns iil_C_FetchStructureString function to default_charset.

And, add rcube_0.2_broken_body_with_cjk.patch

follow-up: ↓ 34   Changed 2 months ago by tensor

Issue for rcube-svn1941-broken-list.patch confirmed.

in reply to: ↑ 13 ; follow-up: ↓ 17   Changed 2 months ago by tensor

Replying to joungkyun:

And, add rcube_0.2_broken_body_with_cjk.patch

Please implement it as an option.

Does this issue occurs for message list or when opening message?

Something like this should go into main.inc.php:

// Some IMAP servers return BODYSTRUCTURE with US-ASCII (IMAP-2006f)
// or X-UNKNOWN (IMAP-2007b) charset when no charset is specified in the message.
// This setting allows you to force decoding of headers using default_charset.
$rcmail_config['override_bodystructure_charset'] = array (
'X-UNKNOWN',
// US-ASCII
);

follow-up: ↓ 18   Changed 2 months ago by alec

  • component changed from Core functionality to IMAP connection

I think, using default_charset in such case in international environment has no sense. Please, attach whole bodystructure reply from your server.

in reply to: ↑ 15   Changed 2 months ago by joungkyun

Replying to tensor:

Replying to joungkyun:

And, add rcube_0.2_broken_body_with_cjk.patch

Please implement it as an option. Does this issue occurs for message list or when opening message?

This issue occurs both situation. See also broken_body_on_list_page.jpg and broken_body_on_opening_message.jpg.

Something like this should go into main.inc.php: {{{ // Some IMAP servers return BODYSTRUCTURE with US-ASCII (IMAP-2006f) // or X-UNKNOWN (IMAP-2007b) charset when no charset is specified in the message. // This setting allows you to force decoding of headers using default_charset. $rcmail_configoverride_bodystructure_charset? = array ( 'X-UNKNOWN', // US-ASCII ); }}}

I think that Good Idea and thanks :-)

Changed 2 months ago by joungkyun

Changed 2 months ago by joungkyun

Changed 2 months ago by joungkyun

in reply to: ↑ 16   Changed 2 months ago by joungkyun

Replying to alec:

I think, using default_charset in such case in international environment has no sense. Please, attach whole bodystructure reply from your server.

I already attached row-!pmail-data.txt. And new attach file send 'no_charset_header_on_body_structure.txt'.

row-mail-data has base64 encoding and no_charset_header_on_body_structure.txt has quoted-printable encoding.

If on imap 2006f, imap server returns as follow.

* 288 FETCH (BODYSTRUCTURE (("TEXT" "HTML" ("CHARSET" "US-ASCII") NIL NIL "QUOTED-PRINTABLE" 1302 34 NIL NIL NIL NIL) "ALTERNATIVE" ("BOUNDARY" "246.4C780F__DDC20") NIL NIL NIL)).
F1247 OK FETCH completed.

If on imap 2007b, imap server returns as follow.

* 288 FETCH (BODYSTRUCTURE (("TEXT" "HTML" ("CHARSET" "X-UNKNOWN") NIL NIL "QUOTED-PRINTABLE" 1302 34 NIL NIL NIL NIL) "ALTERNATIVE" ("BOUNDARY" "246.4C780F__DDC20") NIL NIL NIL)).
F1247 OK FETCH completed.

follow-ups: ↓ 20 ↓ 21   Changed 2 months ago by tensor

Cannot reproduce for no_charset_header_on_body_structure.txt. It properly picks default_charset of EUC-KR when opening mail even without a patch.

What is your default_charset?

in reply to: ↑ 19   Changed 2 months ago by joungkyun

Replying to tensor:

Cannot reproduce for no_charset_header_on_body_structure.txt. It properly picks default_charset of EUC-KR when opening mail even without a patch. What is your default_charset?

$rcmail_configdefault_charset? = 'EUC-KR';

Changed 2 months ago by joungkyun

in reply to: ↑ 19 ; follow-up: ↓ 22   Changed 2 months ago by joungkyun

Replying to tensor:

Cannot reproduce for no_charset_header_on_body_structure.txt. It properly picks default_charset of EUC-KR when opening mail even without a patch. What is your default_charset?

I attached my main.inc.php.

in reply to: ↑ 21   Changed 2 months ago by joungkyun

Replying to joungkyun:

Replying to tensor:

Cannot reproduce for no_charset_header_on_body_structure.txt. It properly picks default_charset of EUC-KR when opening mail even without a patch. What is your default_charset?

I attached my main.inc.php.

Maybe, is it difference with PHP build options between your server and my server?

follow-ups: ↓ 24 ↓ 25   Changed 2 months ago by tensor

Running Debian/lenny, php 5.2.6, latest Courier.

in reply to: ↑ 23   Changed 2 months ago by joungkyun

Replying to tensor:

Running Debian/lenny, php 5.2.6, latest Courier.

my php build option is follow

./configure --prefix=/usr --sysconfdir=/etc/php.d --with-config-file-path=/etc/php.d --with-config-file-scan-dir=/etc/php.d/apache --disable-debug --disable-hash --disable-xmlreader --disable-xmlwriter --disable-json --with-exec-dir=/var/lib/php/bin --with-regex=php --with-mod_charset --with-zend-multibyte --with-zlib --with-zlib-dir=/usr --enable-sigchild --enable-safe-mode --enable-inline-optimization --enable-magic-quotes --enable-track-vars --enable-debugger --enable-sysvsem --enable-sysvshm --enable-sysvmsg --enable-libxml --enable-mbstring=all --enable-mbregex --enable-mbregex-backtrack --with-libmbfl --with-apxs=/usr/sbin/apxs --disable-cli --disable-cgi --with-gd=shared --enable-gd-native-ttf --with-jpeg-dir=/usr --with-png-dir=/usr --with-freetype-dir=/usr --with-sqlite=shared --with-sqlite-utf8 --enable-pdo=shared --with-pdo-sqlite=shared --with-iconv=shared --with-openssl=shared

PHP 5.2.6 Glibc 2.2.4

in reply to: ↑ 23 ; follow-up: ↓ 26   Changed 2 months ago by joungkyun

Replying to tensor:

Running Debian/lenny, php 5.2.6, latest Courier.

If message charset is not exists, What does return charset of courier? Maybe I guess Courier returns non charset..

This case, Cyrus-Imap and Wu-imap return US-ASCII or X-UNKNOWN.

in reply to: ↑ 25   Changed 7 weeks ago by joungkyun

Replying to joungkyun:

Replying to tensor:

Running Debian/lenny, php 5.2.6, latest Courier.

If message charset is not exists, What does return charset of courier? Maybe I guess Courier returns non charset.. This case, Cyrus-Imap and Wu-imap return US-ASCII or X-UNKNOWN.

Hmm, finaly, may I need to patch IMAP server for this problem is fixed?

  Changed 7 weeks ago by tensor

Courier returns NIL instead of "body parameter parenthesized list" when there are no charset defined in the headers.

I vote for the patch at RoundCube side, as there may be other IMAP servers with such problem.

I think it is safe to treat X-UNKNOWN as default_charset. All others should be at the discretion of RoundCube admin.

  Changed 7 weeks ago by alec

In my opinion, we should use rc_detect_encoding() for messages with NIL or X-UNKNOWN charset.

Changed 7 weeks ago by alec

use rc_detect_encoding()

follow-up: ↓ 33   Changed 7 weeks ago by alec

As I said, I have good results using rc_detect_encoding(), so test attached rcube_imap.patch, please.

follow-up: ↓ 31   Changed 7 weeks ago by alec

There is one problem, in example message EUC-KR is detected as BIG5, so 'EUC-KR' must be added before 'BIG5' in rc_detect_encoding's charsets array. It would be nice to improve detection implementing mozilla's charset detector http://www.mozilla.org/projects/intl/chardet.html

in reply to: ↑ 30 ; follow-up: ↓ 32   Changed 7 weeks ago by joungkyun

Replying to alec:

There is one problem, in example message EUC-KR is detected as BIG5, so 'EUC-KR' must be added before 'BIG5' in rc_detect_encoding's charsets array. It would be nice to improve detection implementing mozilla's charset detector http://www.mozilla.org/projects/intl/chardet.html

Some case, EUC-KR is decteced as SJIS. Quality of mb_detected_encoding is not good. See also, attached mb_detect_encoding.jpg

Changed 7 weeks ago by joungkyun

in reply to: ↑ 31   Changed 7 weeks ago by joungkyun

Replying to joungkyun:

Replying to alec:

There is one problem, in example message EUC-KR is detected as BIG5, so 'EUC-KR' must be added before 'BIG5' in rc_detect_encoding's charsets array. It would be nice to improve detection implementing mozilla's charset detector http://www.mozilla.org/projects/intl/chardet.html

Some case, EUC-KR is decteced as SJIS. Quality of mb_detected_encoding is not good. See also, attached mb_detect_encoding.jpg

In my opinion, first member of enc array variable set $failover.

    $failover = ! $failover ? $GLOBALS['CONFIG']['default_charset'] : $failover;
    $enc = array(
    $failover, 'SJIS', 'BIG5', 'GB2312', 'UTF-8',
    'ISO-8859-1', 'ISO-8859-2', 'ISO-8859-3', 'ISO-8859-4',
    'ISO-8859-5', 'ISO-8859-6', 'ISO-8859-7', 'ISO-8859-8', 'ISO-8859-9',
    'ISO-8859-10', 'ISO-8859-13', 'ISO-8859-14', 'ISO-8859-15', 'ISO-8859-16',
    'WINDOWS-1252', 'WINDOWS-1251', 'EUC-JP', 'EUC-TW', 'KOI8-R', 
    'ISO-2022-KR', 'ISO-2022-JP'
    );

in reply to: ↑ 29   Changed 7 weeks ago by tensor

Replying to alec:

As I said, I have good results using rc_detect_encoding(), so test attached rcube_imap.patch, please.

mb_detected_encoding() fails to detect Windows-1251 texts, it returns the first 'ISO-8859-X' encoding in the argument list. See #1485450.

in reply to: ↑ 14 ; follow-up: ↓ 35   Changed 6 weeks ago by joungkyun

Replying to tensor:

Issue for rcube-svn1941-broken-list.patch confirmed.

How about this patch? I checked SVN revsion 2000, but don't apply this patch and, subject that has no charset is still broken after subject that has chatset.

in reply to: ↑ 34   Changed 6 weeks ago by tensor

Replying to joungkyun:

Replying to tensor:

Issue for rcube-svn1941-broken-list.patch confirmed.

How about this patch? I checked SVN revsion 2000, but don't apply this patch and, subject that has no charset is still broken after subject that has chatset.

Yes, it is not applied. I have checked previously the patch rcube-svn1941-broken-list.patch and it appears to be good. Anyone with svn access, please test and commit.

Steps to reproduce:

  1. Create a new folder.
  2. Import the attached message with no charset into the created folder.
  3. Copy any message with charset set in headers to the created folder.
  4. Try to sort the messages by date both asc and desc.

Trying to detect the charset for headers may help, but it often fails (see above). Falling back to default is better almost in all cases.

Also see #1485451.

Note: See TracTickets for help on using tickets.