UTF-8 and strlen()

Trying to find out the length of a string and wondering why the values are often wrong?

UTF-8 characters can be multi-byte, and strlen() returns the length of the string in bytes, which means the string ååå would actually have a “length” of 6.

One solution is to use the multibyte function “mb_strlen” instead, you will need to have PHP compiled with this – but it seems to be a default in later versions.

E.g.

$value = "ÅØÆbob";
echo strlen($value);
// 9
 
echo mb_strlen($value, 'UTF-8');
// 6

12 Comments so far

  1. Bert-Jan on April 16th, 2008

    The same goes for substr, strpos, etc.
    All have mb_* counterparts for multibyte strings.

  2. Russ on April 16th, 2008

    Another point – the Symfony validators take the above into account, so if you have a max length for example it will be correctly validated.

  3. Mike on April 16th, 2008

    Best way to validate string length is isset() function not strlen().

  4. Russ on April 16th, 2008

    Hi Mike,

    I’m not sure why you think that would work, since isset() only returns true or false.

    http://php.net/isset

    Perhaps you have misunderstood? Or maybe there is another way to use isset() which you can elaborate on :o)

  5. Mike on April 16th, 2008

    Hi Russ,

    strlen() is one of most expensive functions so if You want find out if string has more than 100 chars You can do:

    if (isset($string{100})) {
    echo ‘String is longer than 10 chars’;
    }

    isset() is not a function – it’s language construct – so it works faster than usually functions and isset() doesn’t have to count all $string one by one char.

  6. Russ on April 16th, 2008

    How does that work with multi-byte characters though?

    I would assume that isset($string{100}) is counting Bytes, which would cause problems with out UTF-8 dilema.

    Perhaps combined with a call to utf8_decode() it could work, but then that could be more expensive than using mb_stlen()…

    All interesting though, I’ve never seen isset() used like that, and I’m surprised that it is not mentioned in the comments on the PHP site on either isset() or strlen().

  7. zegenie on April 17th, 2008

    Interesting use of isset(), like Russ said. Never seen that one before. It made me have a look and try.

    I tested when saving my document as ISO-8859-15 (norwegian) and (UTF-8). The results:

    $fu = ‘æøå’;
    echo strlen($fu); // 8859-15: length 3, UTF-8: length 6;
    echo mb_strlen($fu); // 8859-15: length 3, UTF-8: length 6 (<–shouldn’t this be 3?);
    echo (isset($fu[4])) ? ‘yes’ . $fu[4] : ‘no’; // 8859-15: ‘no’, UTF-8: ‘yesÃ’;

    So, honestly, it didn’t work either way … Maybe my PHP isn’t compiled with mb_*** support?

  8. zegenie on April 17th, 2008

    btw, I tried with both (isset($fu{4})) and (isset($fu[4])), same results.

  9. Russ on April 18th, 2008

    You should try

    echo mb_strlen($fu, ‘UTF-8’);

    I had to specify the encoding because it didn’t automatically recognise it…

  10. zegenie on April 18th, 2008

    Yeah, I figured as much, just didn’t want to spam your comments 😉 mp_strlen works, but with multibyte strings, using isset() is a no go.

  11. Plamenator on June 25th, 2008

    thanks, man! phew, finally fixed my length! 😀

  12. praca on September 2nd, 2008

    isset() is a language construction and not a function, than is 2-10 times faster than strlen() function.

    $foo = ‘same_string’;

    isset($foo{4}) == isset($foo[4]) but in PHP6 corect is isset($foo[4])

Leave a reply