Perl and UTF-8

Like with python, the handling of [[!wikipedia UTF-8]] string in perl is not straight forward since this language was not originally design to handle Unicode strings of characters.

Correct length of UTF-8 string

[[!wikipedia ASCII]] characters are coded on 8bits that is on one octet. That explain the limited number of different characters and the different encodings according to language. Unicode try to solve it by encoding it on several a dynamic number of octet depending of the characters. Therefore some characters are only coded on one octet (the one that are present in the [[!wikipedia ASCII]] set) some other on two octets.

By default in perl the strings are [[!wikipedia ASCII]] and therefore to determine their length with the length function it only count the number of octet. That fit perfectly when the characters are coded only on one octet but in a lot of language the some used characters are coded on several octets. The length is therefore longer, for example in the following script:

#! /usr/bin/env perl
my $string = "Ceci est une chaîne avec trois caractères codés sur deux octets";
print length($string) . "\n";

$ ./test.pl
66

In fact there are 63 characters in this string.

To get the real string length a conversion of the format is needed. For that the Encode module need to be use to decode the [[!wikipedia UTF-8]] string before measuring its length:

#! /usr/bin/env perl
use Encode;
my $string = "Ceci est une chaîne avec trois caractères codés sur deux octets";
print length(Encode::decode("utf-8", "$string")) . "\n";

$ ./test.pl
63

This time the correct size of the string is returned

Wide character in print

When printing some [[!wikipedia UTF-8]] string on the console it may append that the following error appear:

Wide character in print at test.pl line 3.

To remove of it just add the folling entry at the beginning of your script:

use encoding "utf-8";

Resources

Wide character in print… warning in Perl