Perl Unicode Tutorial 🐫

, , …,

If you are not familiar with Unicode, first see: Unicode Basics: What's Character Set, Character Encoding, UTF-8?

use bytes; # Larry can take Unicode and shove it up his ass sideways.
            # Perl 5.8.0 causes us to start getting incomprehensible
            # errors about UTF-8 all over the place without this.

               —from the source code of WebCollage (1998),
                by Jamie W Zawinski (born 1968)

Starting about Perl v5.12 (≈2010), Unicode support is very good.

when calling scripts that process Unicode, call it with -C option in the command line.

If your Perl script is encoded in UTF-8, then you should declare it, like this: use utf8;. You can have Unicode in string, also in variable names. Example:

# -*- coding: utf-8 -*-
# perl

use strict;
use utf8; # necessary if you want to use Unicode in function or var names

# processing Unicode string
my $s = 'I ★ you';
$s =~ s///;
print "$s\n";

# variable with Unicode char
my $愛 = 4;
print "$愛\n";

# function with Unicode char
sub f愛 { return 2;}
print f愛();

Perl Unicode tips from Tom Christiansen

Here's some Unicode tips, gathered from Tom Christiansen's answer at 〔Why does modern Perl avoid UTF-8 by default? Source〕.

• Declare that this source code file is encoded as UTF‑8.

# -*- coding: utf-8 -*-
# perl

use utf8;

• Demand a particular Perl version, 5.12 or later. Like this:

# -*- coding: utf-8 -*-
# perl

use v5.12; # minimal for Unicode string feature
use v5.14; # optimal for Unicode string feature

• Set your PERL_UNICODE environment variable to AS. This makes all Perl scripts decode @ARGV as UTF‑8 strings, and sets the encoding of all three of stdin, stdout, and stderr to UTF‑8. Both these are global effects, not lexical ones.

• Enable warnings.

# -*- coding: utf-8 -*-
# perl

use warnings;
use warnings qw( FATAL utf8 );

• Declare that anything that opens a filehandles within this lexical scope but not elsewhere is to assume that that stream is encoded in UTF‑8 unless you tell it otherwise. That way you do not affect other module's or other program's code.

# -*- coding: utf-8 -*-
# perl

use open qw( :encoding(UTF-8) :std );

• If you have a DATA handle, you must explicitly set its encoding. If you want this to be UTF‑8, then say: binmode(DATA, ":encoding(UTF-8)");

• Perl supports representing Unicode chars by name. Use the package “charnames”, like this:

# -*- coding: utf-8 -*-
# perl

use utf8;
use v5.12; # minimal for Unicode string feature
use charnames qw( :full ); # allow Unicode char be represented by name, ⁖ \N{CHARNAME}

print "\N{GREEK SMALL LETTER ALPHA}"; # same as "α"


blog comments powered by Disqus