Opened 5 years ago

Closed 5 years ago

Last modified 3 years ago

#1485450 closed Bugs (duplicate)

importing .vcf with "Windows-1251" encoding gives incorrect entries

Reported by: tensor Owned by:
Priority: 5 Milestone: 0.2-stable
Component: Addressbook Version: 0.2-beta
Severity: minor Keywords:
Cc:

Description

I tried to import vcard file from Outlook. Outlook exported the contact with Windows-1251 code page. As a result names are shown incorrectly.

I may be wise to use a default encoding when importing contacts.
Or even better to specify encoding from the list when uploading.

Attachments (3)

Вася Пупкин.vcf (2.6 KB) - added by tensor 5 years ago.
pupkin-incorrect.PNG (3.7 KB) - added by tensor 5 years ago.
incorrect import
pupkin-correct.PNG (4.9 KB) - added by tensor 5 years ago.
correct charset used

Download all attachments as: .zip

Change History (12)

comment:1 Changed 5 years ago by tensor

The real problem is that mb_detect_encoding() returns ISO-8859-1 for a Windows-1251 .vcf file on my Debian/lenny.

comment:2 Changed 5 years ago by alec

Please, attach sample file for testing.

comment:3 Changed 5 years ago by alec

  • Type changed from Feature Requests to Bugs

Changed 5 years ago by tensor

Changed 5 years ago by tensor

incorrect import

Changed 5 years ago by tensor

correct charset used

comment:4 Changed 5 years ago by tensor

To produce the correct charset example I used (default_charset set to "Windows-1251"):

=== program/include/rcube_shared.inc
==================================================================
--- program/include/rcube_shared.inc    (revision 1986)
+++ program/include/rcube_shared.inc    (local)
@@ -556,9 +556,10 @@
        'ISO-2022-KR', 'ISO-2022-JP'
     );
 
-    $result = mb_detect_encoding($string, $enc);
-
-    return $result ? $result : $failover;
+    //$result = mb_detect_encoding($string, $enc);
+    
+    //return $result ? $result : $failover;
+    return '';
 }
 
 ?>
=== program/include/rcube_vcard.php
==================================================================
--- program/include/rcube_vcard.php     (revision 1986)
+++ program/include/rcube_vcard.php     (local)
@@ -412,8 +412,9 @@
         | \xF4[\x80-\x8F][\x80-\xBF]{2}
         )*\z/xs', substr($string, 0, 2048)))
       return 'UTF-8';
-
-    return 'ISO-8859-1'; # fallback to Latin-1
+        
+    //TODO: 'ISO-8859-1' better be define()d as RCMAIL_FALLBACK_CHARSET or with similar name
+    return rcmail::get_instance()->config->get('default_charset', 'ISO-8859-1');
   }
 
 }

On my system mb_detect_encoding($vcf, "Windows-1251") returns empty string despite the fact that vcf file contains Windows-1251 characters.

Also note that .vcf contains individual hints for charset in field definition. Using them would be helpful.

comment:5 Changed 5 years ago by tensor

  • Severity changed from normal to minor

comment:6 Changed 5 years ago by tensor

Oh well, mb_detect_encoding() does not yet support Russian as advertised.

http://bugs.php.net/bug.php?id=38138

comment:7 Changed 5 years ago by tensor

There are two ways to solve this issue.

  1. Test for UTF-8 or other fancy encoding which can be detected by analyzing bits in first several bytes and use default_charset if detection failed. Do not use mb_detect_encoding at all, as it should be named mb_guess_encoding() :)
  • OR -
  1. Provide an explicit dropdown to choose the charset when uploading. Optional confirmation step to show names from .vcf as they were recognized. User would confirm the proper recognition and commit the changes into address book.

comment:8 Changed 5 years ago by tensor

Patch for way 1:

Index: web/program/include/rcube_vcard.php
===================================================================
--- web.orig/program/include/rcube_vcard.php    2008-10-05 04:23:17.000000000 +0400
+++ web/program/include/rcube_vcard.php 2008-10-05 04:45:04.000000000 +0400
@@ -396,9 +396,6 @@
     if (substr($string, 0, 2) == "\xFF\xFE")     return 'UTF-16LE';  // Little Endian
     if (substr($string, 0, 3) == "\xEF\xBB\xBF") return 'UTF-8';
 
-    if ($enc = rc_detect_encoding($string))
-      return $enc;
-
     // No match, check for UTF-8
     // from http://w3.org/International/questions/qa-forms-utf-8.html
     if (preg_match('/\A(
@@ -412,10 +409,8 @@
         | \xF4[\x80-\x8F][\x80-\xBF]{2}
         )*\z/xs', substr($string, 0, 2048)))
       return 'UTF-8';
-
-    return 'ISO-8859-1'; # fallback to Latin-1
+        
+    return rcmail::get_instance()->config->get('default_charset', 'ISO-8859-1');
   }
 
 }
-
-

comment:9 Changed 5 years ago by thomasb

  • Component changed from Core functionality to Addressbook
  • Resolution set to duplicate
  • Status changed from new to closed

Charset detection is not easy! Mark as duplicate of #1485542

Note: See TracTickets for help on using tickets.