Ticket #1485450 (closed Bugs: duplicate)

Opened 3 months ago

Last modified 7 weeks ago

importing .vcf with "Windows-1251" encoding gives incorrect entries

Reported by: tensor Owned by:
Priority: 5 Milestone: 0.2-stable
Component: Addressbook Version: 0.2-beta
Severity: minor Keywords:
Cc:

Description

I tried to import vcard file from Outlook. Outlook exported the contact with Windows-1251 code page. As a result names are shown incorrectly.

I may be wise to use a default encoding when importing contacts. Or even better to specify encoding from the list when uploading.

Attachments

Вася Пупкин.vcf (2.6 kB) - added by tensor 3 months ago.
pupkin-incorrect.PNG (3.7 kB) - added by tensor 3 months ago.
incorrect import
pupkin-correct.PNG (4.9 kB) - added by tensor 3 months ago.
correct charset used

Change History

Changed 3 months ago by tensor

The real problem is that mb_detect_encoding() returns ISO-8859-1 for a Windows-1251 .vcf file on my Debian/lenny.

Changed 3 months ago by alec

Please, attach sample file for testing.

Changed 3 months ago by alec

  • type changed from Feature Requests to Bugs

Changed 3 months ago by tensor

Changed 3 months ago by tensor

incorrect import

Changed 3 months ago by tensor

correct charset used

Changed 3 months ago by tensor

To produce the correct charset example I used (default_charset set to "Windows-1251"):

=== program/include/rcube_shared.inc
==================================================================
--- program/include/rcube_shared.inc    (revision 1986)
+++ program/include/rcube_shared.inc    (local)
@@ -556,9 +556,10 @@
        'ISO-2022-KR', 'ISO-2022-JP'
     );
 
-    $result = mb_detect_encoding($string, $enc);
-
-    return $result ? $result : $failover;
+    //$result = mb_detect_encoding($string, $enc);
+    
+    //return $result ? $result : $failover;
+    return '';
 }
 
 ?>
=== program/include/rcube_vcard.php
==================================================================
--- program/include/rcube_vcard.php     (revision 1986)
+++ program/include/rcube_vcard.php     (local)
@@ -412,8 +412,9 @@
         | \xF4[\x80-\x8F][\x80-\xBF]{2}
         )*\z/xs', substr($string, 0, 2048)))
       return 'UTF-8';
-
-    return 'ISO-8859-1'; # fallback to Latin-1
+        
+    //TODO: 'ISO-8859-1' better be define()d as RCMAIL_FALLBACK_CHARSET or with similar name
+    return rcmail::get_instance()->config->get('default_charset', 'ISO-8859-1');
   }
 
 }

On my system mb_detect_encoding($vcf, "Windows-1251") returns empty string despite the fact that vcf file contains Windows-1251 characters.

Also note that .vcf contains individual hints for charset in field definition. Using them would be helpful.

Changed 3 months ago by tensor

  • severity changed from normal to minor

Changed 3 months ago by tensor

Oh well, mb_detect_encoding() does not yet support Russian as advertised.

http://bugs.php.net/bug.php?id=38138

Changed 3 months ago by tensor

There are two ways to solve this issue.

1. Test for UTF-8 or other fancy encoding which can be detected by analyzing bits in first several bytes and use default_charset if detection failed. Do not use mb_detect_encoding at all, as it should be named mb_guess_encoding() :)

- OR -

2. Provide an explicit dropdown to choose the charset when uploading. Optional confirmation step to show names from .vcf as they were recognized. User would confirm the proper recognition and commit the changes into address book.

Changed 3 months ago by tensor

Patch for way 1:

Index: web/program/include/rcube_vcard.php
===================================================================
--- web.orig/program/include/rcube_vcard.php    2008-10-05 04:23:17.000000000 +0400
+++ web/program/include/rcube_vcard.php 2008-10-05 04:45:04.000000000 +0400
@@ -396,9 +396,6 @@
     if (substr($string, 0, 2) == "\xFF\xFE")     return 'UTF-16LE';  // Little Endian
     if (substr($string, 0, 3) == "\xEF\xBB\xBF") return 'UTF-8';
 
-    if ($enc = rc_detect_encoding($string))
-      return $enc;
-
     // No match, check for UTF-8
     // from http://w3.org/International/questions/qa-forms-utf-8.html
     if (preg_match('/\A(
@@ -412,10 +409,8 @@
         | \xF4[\x80-\x8F][\x80-\xBF]{2}
         )*\z/xs', substr($string, 0, 2048)))
       return 'UTF-8';
-
-    return 'ISO-8859-1'; # fallback to Latin-1
+        
+    return rcmail::get_instance()->config->get('default_charset', 'ISO-8859-1');
   }
 
 }
-
-

Changed 7 weeks ago by thomasb

  • status changed from new to closed
  • resolution set to duplicate
  • component changed from Core functionality to Addressbook

Charset detection is not easy! Mark as duplicate of #1485542

Note: See TracTickets for help on using tickets.