Perl: Unicode Tutorial 🐪

By Xah Lee. Date: . Last updated: .

If you are not familiar with Unicode, first see: Unicode Basics: Character Set, Encoding, UTF-8

use bytes; # Larry can take Unicode and shove it up his ass sideways.
            # Perl 5.8.0 causes us to start getting incomprehensible
            # errors about UTF-8 all over the place without this.

from the source code of WebCollage (1998), by Jamie W Zawinski (born 1968)

Starting about Perl v5.12 (~2010), Unicode support is very good.

when calling scripts that process Unicode, call it with -C option in the command line.

If your Perl script is encoded in UTF-8 , then you should declare it, like this:

use utf8;

You can have Unicode character in string, also in variable or function names.

# -*- coding: utf-8 -*-
# perl

use strict;
use utf8; # necessary if you want to use Unicode in function or var names

# processing Unicode string
my $s = 'I ★ you';
$s =~ s/★/♥/;
print "$s\n";

# variable with Unicode char
my $ = 4;
print "$愛\n";

# function with Unicode char
sub f愛 { return 2;}
print f愛();

Unicode in Perl Identifiers

Identifier cannot be arbitrary unicode char.

# -*- coding: utf-8 -*-
# perl v5.18.2

use strict;
use utf8;

# identifier cannot be arbitrary unicode char

my $😂 = 3;

# error
# Unrecognized character \x{1f602}
# -*- coding: utf-8 -*-
# perl v5.18.2

use strict;
use utf8;

# identifier cannot be arbitrary unicode char

my $♥ = 3;

# error
# Unrecognized character \x{2665}

The exact rule is complicated. But basically, if the unicode is considered a letter, then, it's ok. Heart (U+2665: BLACK HEART SUIT) , or the Summation sign (U+2211: N-ARY SUMMATION) , are not letters.

Perl Unicode tips from Tom Christiansen

Here's some Unicode tips, gathered from Tom Christiansen's answer at [Why does modern Perl avoid UTF-8 by default? http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129 ].

• Declare that this source code file is encoded as UTF‑8.

# -*- coding: utf-8 -*-
# perl

use utf8;

• Demand a particular Perl version, 5.12 or later. Like this:

# -*- coding: utf-8 -*-
# perl

use v5.12; # minimal for Unicode string feature
use v5.14; # optimal for Unicode string feature

• Set your PERL_UNICODE environment variable to AS. This makes all Perl scripts decode @ARGV as UTF‑8 strings, and sets the encoding of all three of stdin, stdout, and stderr to UTF‑8. Both these are global effects.

• Enable warnings.

# -*- coding: utf-8 -*-
# perl

use warnings;
use warnings qw( FATAL utf8 );

• Declare that anything that opens a filehandles within this lexical scope.

# -*- coding: utf-8 -*-
# perl

use open qw( :encoding(UTF-8) :std );

• If you have a DATA handle, you must explicitly set its encoding. If you want this to be UTF‑8, then say: binmode(DATA, ":encoding(UTF-8)");

• Perl supports representing Unicode chars by name. Use the package “charnames”, like this:

# -*- coding: utf-8 -*-
# perl

use utf8;
use v5.12; # minimal for Unicode string feature
use charnames qw( :full ); # allow Unicode char be represented by name, for example, \N{CHARNAME}

print "\N{GREEK SMALL LETTER ALPHA}"; # same as "α"

Reference