UTF-8 and strlen()
Trying to find out the length of a string and wondering why the values are often wrong?
UTF-8 characters can be multi-byte, and strlen() returns the length of the string in bytes, which means the string ååå would actually have a “length” of 6.
One solution is to use the multibyte function “mb_strlen” instead, you will need to have PHP compiled with this – but it seems to be a default in later versions.
E.g.
$value = "ÅØÆbob"; echo strlen($value); // 9 echo mb_strlen($value, 'UTF-8'); // 6
Comments(12)
The same goes for substr, strpos, etc.
All have mb_* counterparts for multibyte strings.
Another point – the Symfony validators take the above into account, so if you have a max length for example it will be correctly validated.
Best way to validate string length is isset() function not strlen().
Hi Mike,
I’m not sure why you think that would work, since isset() only returns true or false.
http://php.net/isset
Perhaps you have misunderstood? Or maybe there is another way to use isset() which you can elaborate on
)
Hi Russ,
strlen() is one of most expensive functions so if You want find out if string has more than 100 chars You can do:
if (isset($string{100})) {
echo ‘String is longer than 10 chars’;
}
isset() is not a function – it’s language construct – so it works faster than usually functions and isset() doesn’t have to count all $string one by one char.
How does that work with multi-byte characters though?
I would assume that isset($string{100}) is counting Bytes, which would cause problems with out UTF-8 dilema.
Perhaps combined with a call to utf8_decode() it could work, but then that could be more expensive than using mb_stlen()…
All interesting though, I’ve never seen isset() used like that, and I’m surprised that it is not mentioned in the comments on the PHP site on either isset() or strlen().
Interesting use of isset(), like Russ said. Never seen that one before. It made me have a look and try.
I tested when saving my document as ISO-8859-15 (norwegian) and (UTF-8). The results:
$fu = ‘æøå’;
echo strlen($fu); // 8859-15: length 3, UTF-8: length 6;
echo mb_strlen($fu); // 8859-15: length 3, UTF-8: length 6 (<–shouldn’t this be 3?);
echo (isset($fu[4])) ? ‘yes’ . $fu[4] : ‘no’; // 8859-15: ‘no’, UTF-8: ‘yesÃ’;
So, honestly, it didn’t work either way … Maybe my PHP isn’t compiled with mb_*** support?
btw, I tried with both (isset($fu{4})) and (isset($fu[4])), same results.
You should try
echo mb_strlen($fu, ‘UTF-8′);
I had to specify the encoding because it didn’t automatically recognise it…
Yeah, I figured as much, just didn’t want to spam your comments
mp_strlen works, but with multibyte strings, using isset() is a no go.
thanks, man! phew, finally fixed my length!
isset() is a language construction and not a function, than is 2-10 times faster than strlen() function.
$foo = ‘same_string’;
isset($foo{4}) == isset($foo[4]) but in PHP6 corect is isset($foo[4])