#1485450 closed Bugs (duplicate)
importing .vcf with "Windows-1251" encoding gives incorrect entries
| Reported by: | tensor | Owned by: | |
|---|---|---|---|
| Priority: | 5 | Milestone: | 0.2-stable |
| Component: | Addressbook | Version: | 0.2-beta |
| Severity: | minor | Keywords: | |
| Cc: |
Description
I tried to import vcard file from Outlook. Outlook exported the contact with Windows-1251 code page. As a result names are shown incorrectly.
I may be wise to use a default encoding when importing contacts.
Or even better to specify encoding from the list when uploading.
Attachments (3)
Change History (12)
comment:1 Changed 5 years ago by tensor
comment:2 Changed 5 years ago by alec
Please, attach sample file for testing.
comment:3 Changed 5 years ago by alec
- Type changed from Feature Requests to Bugs
Changed 5 years ago by tensor
comment:4 Changed 5 years ago by tensor
To produce the correct charset example I used (default_charset set to "Windows-1251"):
=== program/include/rcube_shared.inc
==================================================================
--- program/include/rcube_shared.inc (revision 1986)
+++ program/include/rcube_shared.inc (local)
@@ -556,9 +556,10 @@
'ISO-2022-KR', 'ISO-2022-JP'
);
- $result = mb_detect_encoding($string, $enc);
-
- return $result ? $result : $failover;
+ //$result = mb_detect_encoding($string, $enc);
+
+ //return $result ? $result : $failover;
+ return '';
}
?>
=== program/include/rcube_vcard.php
==================================================================
--- program/include/rcube_vcard.php (revision 1986)
+++ program/include/rcube_vcard.php (local)
@@ -412,8 +412,9 @@
| \xF4[\x80-\x8F][\x80-\xBF]{2}
)*\z/xs', substr($string, 0, 2048)))
return 'UTF-8';
-
- return 'ISO-8859-1'; # fallback to Latin-1
+
+ //TODO: 'ISO-8859-1' better be define()d as RCMAIL_FALLBACK_CHARSET or with similar name
+ return rcmail::get_instance()->config->get('default_charset', 'ISO-8859-1');
}
}
On my system mb_detect_encoding($vcf, "Windows-1251") returns empty string despite the fact that vcf file contains Windows-1251 characters.
Also note that .vcf contains individual hints for charset in field definition. Using them would be helpful.
comment:5 Changed 5 years ago by tensor
- Severity changed from normal to minor
comment:6 Changed 5 years ago by tensor
Oh well, mb_detect_encoding() does not yet support Russian as advertised.
comment:7 Changed 5 years ago by tensor
There are two ways to solve this issue.
- Test for UTF-8 or other fancy encoding which can be detected by analyzing bits in first several bytes and use default_charset if detection failed. Do not use mb_detect_encoding at all, as it should be named mb_guess_encoding() :)
- OR -
- Provide an explicit dropdown to choose the charset when uploading. Optional confirmation step to show names from .vcf as they were recognized. User would confirm the proper recognition and commit the changes into address book.
comment:8 Changed 5 years ago by tensor
Patch for way 1:
Index: web/program/include/rcube_vcard.php
===================================================================
--- web.orig/program/include/rcube_vcard.php 2008-10-05 04:23:17.000000000 +0400
+++ web/program/include/rcube_vcard.php 2008-10-05 04:45:04.000000000 +0400
@@ -396,9 +396,6 @@
if (substr($string, 0, 2) == "\xFF\xFE") return 'UTF-16LE'; // Little Endian
if (substr($string, 0, 3) == "\xEF\xBB\xBF") return 'UTF-8';
- if ($enc = rc_detect_encoding($string))
- return $enc;
-
// No match, check for UTF-8
// from http://w3.org/International/questions/qa-forms-utf-8.html
if (preg_match('/\A(
@@ -412,10 +409,8 @@
| \xF4[\x80-\x8F][\x80-\xBF]{2}
)*\z/xs', substr($string, 0, 2048)))
return 'UTF-8';
-
- return 'ISO-8859-1'; # fallback to Latin-1
+
+ return rcmail::get_instance()->config->get('default_charset', 'ISO-8859-1');
}
}
-
-
comment:9 Changed 5 years ago by thomasb
- Component changed from Core functionality to Addressbook
- Resolution set to duplicate
- Status changed from new to closed
Charset detection is not easy! Mark as duplicate of #1485542

The real problem is that mb_detect_encoding() returns ISO-8859-1 for a Windows-1251 .vcf file on my Debian/lenny.