Opened 7 years ago

Closed 6 years ago

#1484119 closed Bugs (fixed)

Charset bug in contacts

Reported by: b@… Owned by: thomasb
Priority: 5 Milestone: 0.1-rc1
Component: Client Scripts Version: git-master
Severity: normal Keywords:
Cc:

Description

When I add name of person, whom send me a message.
In screen, where I can see contents of the message I press + to add person to my contacts, but in my contacts book it looks like http://forum.roundcube.ru/index.php?act=Attach&type=post&id=16 (first line)

Change History (6)

comment:1 Changed 7 years ago by b@…

  • Summary changed from Chaset bug in contacts to Charset bug in contacts

comment:2 Changed 7 years ago by tcat

  • Version changed from 0.1-beta2 to svn-trunk

Yes! I have same problem with hungarian characters. I think the problem is in the rcube_imap.inc. The php can\'t handle unicode double-byte characters (not shure, but you can compile 5.2. with full unicode support).

str_replace, strpos, stripslashes, preg_replace (used many times in rcube_imap.inc / _parse_address_list(), decode_address_list() and called from steps/mail/addcontact.inc) can split multibyte characters. Maybe you can use custom function instead of built in.

I write some example:

/**
 * Count the amount of characters in a UTF-8 string. This is less than or
 * equal to the byte count.
 */
function unicode_strlen($text) {
  if (function_exists('mb_strlen')) {
    return mb_strlen($text);
  }
  else {
    // Do not count UTF-8 continuation bytes.
    return strlen(preg_replace("/[\x80-\xBF]/", '', $text));
  }
}
/**
 * Cut off a piece of a string based on character indices and counts. Follows
 * the same behaviour as PHP's own substr() function.
 *
 * Note that for cutting off a string at a known character/substring
 * location, the usage of PHP's normal strpos/substr is safe and
 * much faster.
 */
function unicode_substr($text, $start, $length = NULL) {
  if (function_exists('mb_substr')) {
    return $length === NULL ? mb_substr($text, $start) : mb_substr($text, $start, $length);
  }
  else {
    $strlen = strlen($text);
    // Find the starting byte offset
    if ($start > 0) {
      // Count all the continuation bytes from the start until we have found
      // $start characters
      $bytes = -1; $chars = -1;
      while ($bytes < $strlen && $chars < $start) {
        $bytes++;
        $c = ord($text[$bytes]);
        if ($c < 0x80 || $c >= 0xC0) {
          $chars++;
        }
      }
    }
    else if ($start < 0) {
      // Count all the continuation bytes from the end until we have found
      // abs($start) characters
      $start = abs($start);
      $bytes = $strlen; $chars = 0;
      while ($bytes > 0 && $chars < $start) {
        $bytes--;
        $c = ord($text[$bytes]);
        if ($c < 0x80 || $c >= 0xC0) {
          $chars++;
        }
      }
    }
    $istart = $bytes;

    // Find the ending byte offset
    if ($length === NULL) {
      $bytes = $strlen - 1;
    }
    else if ($length > 0) {
      // Count all the continuation bytes from the starting index until we have
      // found $length + 1 characters. Then backtrack one byte.
      $bytes = $istart; $chars = 0;
      while ($bytes < $strlen && $chars < $length) {
        $bytes++;
        $c = ord($text[$bytes]);
        if ($c < 0x80 || $c >= 0xC0) {
          $chars++;
        }
      }
      $bytes--;
    }
    else if ($length < 0) {
      // Count all the continuation bytes from the end until we have found
      // abs($length) characters
      $length = abs($length);
      $bytes = $strlen - 1; $chars = 0;
      while ($bytes >= 0 && $chars < $length) {
        $c = ord($text[$bytes]);
        if ($c < 0x80 || $c >= 0xC0) {
          $chars++;
        }
        $bytes--;
      }
    }
    $iend = $bytes;

    return substr($text, $istart, max(0, $iend - $istart + 1));
  }
}


/**
 * Decode all HTML entities (including numerical ones) to regular UTF-8 bytes.
 * Double-escaped entities will only be decoded once ("&amp;lt;" becomes "&lt;", not "<").
 *
 * @param $text
 *   The text to decode entities in.
 * @param $exclude
 *   An array of characters which should not be decoded. For example,
 *   array('<', '&', '"'). This affects both named and numerical entities.
 */
function decode_entities($text, $exclude = array()) {
  static $table;
  // We store named entities in a table for quick processing.
  if (!isset($table)) {
    // Get all named HTML entities.
    $table = array_flip(get_html_translation_table(HTML_ENTITIES));
    // PHP gives us ISO-8859-1 data, we need UTF-8.
    $table = array_map('utf8_encode', $table);
    // Add apostrophe (XML)
    $table['&apos;'] = "'";
  }
  $newtable = array_diff($table, $exclude);

  // Use a regexp to select all entities in one pass, to avoid decoding double-escaped entities twice.
  return preg_replace('/&(#x?)?([A-Za-z0-9]+);/e', '_decode_entities("$1", "$2", "$0", $newtable, $exclude)', $text);
}

/**
 * Helper function for decode_entities
 */
function _decode_entities($prefix, $codepoint, $original, &$table, &$exclude) {
  // Named entity
  if (!$prefix) {
    if (isset($table[$original])) {
      return $table[$original];
    }
    else {
      return $original;
    }
  }
  // Hexadecimal numerical entity
  if ($prefix == '#x') {
    $codepoint = base_convert($codepoint, 16, 10);
  }
  // Decimal numerical entity (strip leading zeros to avoid PHP octal notation)
  else {
    $codepoint = preg_replace('/^0+/', '', $codepoint);
  }
  // Encode codepoint as UTF-8 bytes
  if ($codepoint < 0x80) {
    $str = chr($codepoint);
  }
  else if ($codepoint < 0x800) {
    $str = chr(0xC0 | ($codepoint >> 6))
         . chr(0x80 | ($codepoint & 0x3F));
  }
  else if ($codepoint < 0x10000) {
    $str = chr(0xE0 | ( $codepoint >> 12))
         . chr(0x80 | (($codepoint >> 6) & 0x3F))
         . chr(0x80 | ( $codepoint       & 0x3F));
  }
  else if ($codepoint < 0x200000) {
    $str = chr(0xF0 | ( $codepoint >> 18))
         . chr(0x80 | (($codepoint >> 12) & 0x3F))
         . chr(0x80 | (($codepoint >> 6)  & 0x3F))
         . chr(0x80 | ( $codepoint        & 0x3F));
  }
  // Check for excluded characters
  if (in_array($str, $exclude)) {
    return $original;
  }
  else {
    return $str;
  }
}

Changing the built in functions can fix your problem, but whithout using mbstring extension, these functions will run much slower than built in.

comment:3 Changed 6 years ago by thomasb

  • Milestone set to 0.1-rc1
  • Owner set to thomasb
  • Status changed from new to assigned

Related: #1484217 (with Screenshots)

comment:4 Changed 6 years ago by mattenklicker

I added in program/steps/mail/addcontact.inc after line 29 ($contact = $contact_arr[1];):

        if ($contact['name'])
        {
                $contact['name']=utf8_decode($contact['name']);
        }

This works for me.

comment:5 Changed 6 years ago by thomasb

Related bug: #1484329

comment:6 Changed 6 years ago by thomasb

  • Resolution set to fixed
  • Status changed from assigned to closed

Fixed in trunk ([f1154163])

Note: See TracTickets for help on using tickets.