Encoding UTF8/ISO8859-1

From MapbenderWiki

Jump to: navigation, search

Contents

interested parties

User:Christoph Baudson

User:Uli Rothstein

Abstract

As the handling of incoming and outgoing data (in varying encodings) is still not coherent, we need to improve the stability by a more sophisticated encoding handling.

handling in PHP

Example: loading a XML (like a capabilities document)

xml_parser_create

Most of the XML parsing in Mapbender is done by this function. But xml_parser_create (option) behaves differently in PHP4 and PHP5 (from php.net)

"The optional encoding specifies the character encoding for the input/output in PHP 4. 
Starting from PHP 5, the input encoding is automatically detected, so that the encoding 
parameter specifies only the output encoding. In PHP 4, the default output encoding is 
the same as the input charset. If empty string is passed, the parser attempts to 
identify which encoding the document is encoded in by looking at the heading 3 or 4 
bytes. In PHP 5.0.0 and 5.0.1, the default output charset is ISO-8859-1, while in PHP 
5.0.2 and upper is UTF-8. The supported encodings are ISO-8859-1, UTF-8 and US-ASCII."

But even in PHP5 the above works in theory only. When I load an XML in UTF-8 and want to convert it to ISO-8859-1, the XML parser halts with the error message

Invalid character in line xxx

I have to convert it via utf8_decode() first.

utf8_decode

works. I have written some functions in class_administration.php; char_encode($data) is supposed to detect the encoding of $data and convert it to CHARSET (given in mapbender.conf)

function is_utf8_string($string) {
 return preg_match('%(?:
 [\xC2-\xDF][\x80-\xBF]               # non-overlong 2-byte
 |\xE0[\xA0-\xBF][\x80-\xBF]          # excluding overlongs
 |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}   # straight 3-byte
 |\xED[\x80-\x9F][\x80-\xBF]          # excluding surrogates
 |\xF0[\x90-\xBF][\x80-\xBF]{2}       # planes 1-3
 |[\xF1-\xF3][\x80-\xBF]{3}           # planes 4-15
 |\xF4[\x80-\x8F][\x80-\xBF]{2}       # plane 16
 )+%xs', $string);
}
	
function is_utf8_xml($xml) {
 return preg_match('/<\?xml[^>]+encoding="utf-8"[^>]*\?>/is', $xml);
}
	
function is_utf8($data) {
 return ($this->is_utf8_xml($data) || $this->is_utf8_string($data));
}
	 
function char_encode($data) {
 if (CHARSET == "UTF-8") {
  if (!$this->is_utf8($data)) {
   $e = new mb_notice("Conversion: ISO-8859-1 to UTF-8");
   return utf8_encode($data);
  }
 }
 else {
  if ($this->is_utf8($data)) {
   $e = new mb_notice("Conversion: UTF-8 to ISO-8859-1");
   return utf8_decode($data);
  }
 }
 $e = new mb_notice("no conversion: is " . CHARSET);
 return $data;
}

Is there an easier way to determine the encoding of an XML?

iconv

tried this:

$data = iconv("UTF-8", "ISO-8859-1", $data);

The XML parser halts with the following error message

Invalid document end in line xxx

multi-byte string functions

http://de2.php.net/manual/en/function.mb-convert-encoding.php

haven't tried it.


(Read more about xml_parser_create here)

demo data for testing

UTF-8 capabilities document

http://wms1.ccgis.de/cgi-bin/mapserv410?map=/data/umn/germany/germany_utf8_group.map&REQUEST=GetCapabilities&SERVICE=WMS&VERSION=1.1.1

ISO-8859-1 capabilities document

http://wms1.ccgis.de/cgi-bin/mapserv?map=/data/umn/germany/germany.map&&VERSION=1.1.1&REQUEST=GetCapabilities&SERVICE=WMS


Estimate of function calls, which are not multi-byte clean

  • strlen 98
  • strpos 30
  • strrpos 6
  • substr 239
  • strtolower 57
  • strtoupper 234
  • ereg 6
  • eregi 2
  • ereg_replace 6
  • eregi_replace 8
  • split 149

Test case for multi byte functions

<?php
$texts = array('öäü', 'aä');
$encodings = array('ISO-8859-1', 'UTF-8', 'ASCII');
foreach ($encodings as $enc) {
  ini_set('default_charset', $enc);
  ini_set('mbstring.internal_encoding', $enc);
  ini_set('iconv.internal_encoding', $enc);
  echo "default_charset: ".ini_get('default_charset')."\n";
  echo "mbstring.internal_encoding: ".ini_get('mbstring.internal_encoding').
    "\n";
  echo "iconv.internal_encoding: ".ini_get('iconv.internal_encoding')."\n";
  foreach ($texts as $text){
    echo "strlen($text): ".strlen($text)."\n";
    echo "mb_strlen($text): ".mb_strlen($text)."\n";
    echo "iconv_strlen($text): ".iconv_strlen($text)."\n";
  }
  echo "\n";
}
?>
Personal tools