Perl: Unicode Tutorial 🐪
If you are not familiar with Unicode, first see: Unicode Basics: Character Set, Encoding, UTF-8
use bytes; # Larry can take Unicode and shove it up his ass sideways. # Perl 5.8.0 causes us to start getting incomprehensible # errors about UTF-8 all over the place without this.
from the source code of WebCollage (1998), by Jamie W Zawinski (born 1968)
Starting about Perl v5.12 (~2010), Unicode support is very good.
when calling scripts that process Unicode, call it with -C
option in the command line.
If your Perl script is encoded in UTF-8 , then you should declare it, like this:
use utf8;
You can have Unicode character in string, also in variable or function names.
# -*- coding: utf-8 -*- # perl use strict; use utf8; # necessary if you want to use Unicode in function or var names # processing Unicode string my $s = 'I ★ you'; $s =~ s/★/♥/; print "$s\n"; # variable with Unicode char my $愛 = 4; print "$愛\n"; # function with Unicode char sub f愛 { return 2;} print f愛();
Unicode in Perl Identifiers
Identifier cannot be arbitrary unicode char.
# -*- coding: utf-8 -*- # perl v5.18.2 use strict; use utf8; # identifier cannot be arbitrary unicode char my $😂 = 3; # error # Unrecognized character \x{1f602}
# -*- coding: utf-8 -*- # perl v5.18.2 use strict; use utf8; # identifier cannot be arbitrary unicode char my $♥ = 3; # error # Unrecognized character \x{2665}
The exact rule is complicated. But basically, if the unicode is considered a letter, then, it's ok. Heart ♥ (U+2665: BLACK HEART SUIT) , or the Summation sign ∑ (U+2211: N-ARY SUMMATION) , are not letters.
Perl Unicode tips from Tom Christiansen
Here's some Unicode tips, gathered from Tom Christiansen's answer at [Why does modern Perl avoid UTF-8 by default? http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129 ].
• Declare that this source code file is encoded as UTF‑8.
# -*- coding: utf-8 -*- # perl use utf8;
• Demand a particular Perl version, 5.12 or later. Like this:
# -*- coding: utf-8 -*- # perl use v5.12; # minimal for Unicode string feature use v5.14; # optimal for Unicode string feature
• Set your PERL_UNICODE
environment variable to AS
. This makes all Perl scripts decode @ARGV
as UTF‑8 strings, and sets the encoding of all three of stdin, stdout, and stderr to UTF‑8. Both these are global effects.
• Enable warnings.
# -*- coding: utf-8 -*- # perl use warnings; use warnings qw( FATAL utf8 );
• Declare that anything that opens a filehandles within this lexical scope.
# -*- coding: utf-8 -*- # perl use open qw( :encoding(UTF-8) :std );
• If you have a DATA handle, you must explicitly set its encoding. If you want this to be UTF‑8, then say: binmode(DATA, ":encoding(UTF-8)");
• Perl supports representing Unicode chars by name. Use the package “charnames”, like this:
# -*- coding: utf-8 -*- # perl use utf8; use v5.12; # minimal for Unicode string feature use charnames qw( :full ); # allow Unicode char be represented by name, for example, \N{CHARNAME} print "\N{GREEK SMALL LETTER ALPHA}"; # same as "α"