Encoding UTF8/ISO8859-1
From MapbenderWiki
Contents |
interested parties
Abstract
As the handling of incoming and outgoing data (in varying encodings) is still not coherent, we need to improve the stability by a more sophisticated encoding handling.
handling in PHP
Example: loading a XML (like a capabilities document)
xml_parser_create
Most of the XML parsing in Mapbender is done by this function. But xml_parser_create (option) behaves differently in PHP4 and PHP5 (from php.net)
"The optional encoding specifies the character encoding for the input/output in PHP 4. Starting from PHP 5, the input encoding is automatically detected, so that the encoding parameter specifies only the output encoding. In PHP 4, the default output encoding is the same as the input charset. If empty string is passed, the parser attempts to identify which encoding the document is encoded in by looking at the heading 3 or 4 bytes. In PHP 5.0.0 and 5.0.1, the default output charset is ISO-8859-1, while in PHP 5.0.2 and upper is UTF-8. The supported encodings are ISO-8859-1, UTF-8 and US-ASCII."
But even in PHP5 the above works in theory only. When I load an XML in UTF-8 and want to convert it to ISO-8859-1, the XML parser halts with the error message
Invalid character in line xxx
I have to convert it via utf8_decode() first.
utf8_decode
works. I have written some functions in class_administration.php; char_encode($data) is supposed to detect the encoding of $data and convert it to CHARSET (given in mapbender.conf)
function is_utf8_string($string) {
return preg_match('%(?:
[\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
|\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
|\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
|\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
|[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
|\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)+%xs', $string);
}
function is_utf8_xml($xml) {
return preg_match('/<\?xml[^>]+encoding="utf-8"[^>]*\?>/is', $xml);
}
function is_utf8($data) {
return ($this->is_utf8_xml($data) || $this->is_utf8_string($data));
}
function char_encode($data) {
if (CHARSET == "UTF-8") {
if (!$this->is_utf8($data)) {
$e = new mb_notice("Conversion: ISO-8859-1 to UTF-8");
return utf8_encode($data);
}
}
else {
if ($this->is_utf8($data)) {
$e = new mb_notice("Conversion: UTF-8 to ISO-8859-1");
return utf8_decode($data);
}
}
$e = new mb_notice("no conversion: is " . CHARSET);
return $data;
}
Is there an easier way to determine the encoding of an XML?
iconv
tried this:
$data = iconv("UTF-8", "ISO-8859-1", $data);
The XML parser halts with the following error message
Invalid document end in line xxx
multi-byte string functions
http://de2.php.net/manual/en/function.mb-convert-encoding.php
haven't tried it.
(Read more about xml_parser_create here)
demo data for testing
UTF-8 capabilities document
ISO-8859-1 capabilities document
Estimate of function calls, which are not multi-byte clean
- strlen 98
- strpos 30
- strrpos 6
- substr 239
- strtolower 57
- strtoupper 234
- ereg 6
- eregi 2
- ereg_replace 6
- eregi_replace 8
- split 149
Test case for multi byte functions
<?php
$texts = array('öäü', 'aä');
$encodings = array('ISO-8859-1', 'UTF-8', 'ASCII');
foreach ($encodings as $enc) {
ini_set('default_charset', $enc);
ini_set('mbstring.internal_encoding', $enc);
ini_set('iconv.internal_encoding', $enc);
echo "default_charset: ".ini_get('default_charset')."\n";
echo "mbstring.internal_encoding: ".ini_get('mbstring.internal_encoding').
"\n";
echo "iconv.internal_encoding: ".ini_get('iconv.internal_encoding')."\n";
foreach ($texts as $text){
echo "strlen($text): ".strlen($text)."\n";
echo "mb_strlen($text): ".mb_strlen($text)."\n";
echo "iconv_strlen($text): ".iconv_strlen($text)."\n";
}
echo "\n";
}
?>

