Perl: Unicode Tutorial 🐪

By Xah Lee. Date: . Last updated: .

If you are not familiar with Unicode, first see: Unicode Basics: What's Character Set, Character Encoding, UTF-8?

use bytes; # Larry can take Unicode and shove it up his ass sideways.
            # Perl 5.8.0 causes us to start getting incomprehensible
            # errors about UTF-8 all over the place without this.

               —from the source code of WebCollage (1998),
                by Jamie W Zawinski (born 1968)

Starting about Perl v5.12 (~2010), Unicode support is very good.

when calling scripts that process Unicode, call it with -C option in the command line.

If your Perl script is encoded in UTF-8, then you should declare it, like this: use utf8;. You can have Unicode character in string, also in variable or function names.

# -*- coding: utf-8 -*-
# perl

use strict;
use utf8; # necessary if you want to use Unicode in function or var names

# processing Unicode string
my $s = 'I ★ you';
$s =~ s///;
print "$s\n";

# variable with Unicode char
my $愛 = 4;
print "$愛\n";

# function with Unicode char
sub f愛 { return 2;}
print f愛();

Perl Unicode tips from Tom Christiansen

Here's some Unicode tips, gathered from Tom Christiansen's answer at 〔 Why does modern Perl avoid UTF-8 by default?〕.

• Declare that this source code file is encoded as UTF‑8.

# -*- coding: utf-8 -*-
# perl

use utf8;

• Demand a particular Perl version, 5.12 or later. Like this:

# -*- coding: utf-8 -*-
# perl

use v5.12; # minimal for Unicode string feature
use v5.14; # optimal for Unicode string feature

• Set your PERL_UNICODE environment variable to AS. This makes all Perl scripts decode @ARGV as UTF‑8 strings, and sets the encoding of all three of stdin, stdout, and stderr to UTF‑8. Both these are global effects.

• Enable warnings.

# -*- coding: utf-8 -*-
# perl

use warnings;
use warnings qw( FATAL utf8 );

• Declare that anything that opens a filehandles within this lexical scope.

# -*- coding: utf-8 -*-
# perl

use open qw( :encoding(UTF-8) :std );

• If you have a DATA handle, you must explicitly set its encoding. If you want this to be UTF‑8, then say: binmode(DATA, ":encoding(UTF-8)");

• Perl supports representing Unicode chars by name. Use the package “charnames”, like this:

# -*- coding: utf-8 -*-
# perl

use utf8;
use v5.12; # minimal for Unicode string feature
use charnames qw( :full ); # allow Unicode char be represented by name, for example, \N{CHARNAME}

print "\N{GREEK SMALL LETTER ALPHA}"; # same as "α"