Perl: Unicode Tutorial 🐪

By Xah Lee. Date: . Last updated: .

If you are not familiar with Unicode, first see: Unicode Basics: What's Character Set, Character Encoding, UTF-8?

use bytes; # Larry can take Unicode and shove it up his ass sideways.
            # Perl 5.8.0 causes us to start getting incomprehensible
            # errors about UTF-8 all over the place without this.

               —from the source code of WebCollage (1998),
                by Jamie W Zawinski (born 1968)

Starting about Perl v5.12 (~2010), Unicode support is very good.

when calling scripts that process Unicode, call it with -C option in the command line.

If your Perl script is encoded in UTF-8, then you should declare it, like this: use utf8;.

You can have Unicode character in string, also in variable or function names.

# -*- coding: utf-8 -*-
# perl

use strict;
use utf8; # necessary if you want to use Unicode in function or var names

# processing Unicode string
my $s = 'I ★ you';
$s =~ s///;
print "$s\n";

# variable with Unicode char
my $愛 = 4;
print "$愛\n";

# function with Unicode char
sub f愛 { return 2;}
print f愛();

Unicode in Perl Identifiers

Identifier cannot be arbitrary unicode char.

# -*- coding: utf-8 -*-
# perl v5.18.2

use strict;
use utf8;

# identifier cannot be arbitrary unicode char

my $😂 = 3;

# error
# Unrecognized character \x{1f602}
# -*- coding: utf-8 -*-
# perl v5.18.2

use strict;
use utf8;

# identifier cannot be arbitrary unicode char

my $♥ = 3;

# error
# Unrecognized character \x{2665}

The exact rule is complicated. But basically, if the unicode is considered a letter, then, it's ok. Heart , or the Summation sign , are not letters.

Perl Unicode tips from Tom Christiansen

Here's some Unicode tips, gathered from Tom Christiansen's answer at [Why does modern Perl avoid UTF-8 by default? http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/6163129#6163129 ].

• Declare that this source code file is encoded as UTF‑8.

# -*- coding: utf-8 -*-
# perl

use utf8;

• Demand a particular Perl version, 5.12 or later. Like this:

# -*- coding: utf-8 -*-
# perl

use v5.12; # minimal for Unicode string feature
use v5.14; # optimal for Unicode string feature

• Set your PERL_UNICODE environment variable to AS. This makes all Perl scripts decode @ARGV as UTF‑8 strings, and sets the encoding of all three of stdin, stdout, and stderr to UTF‑8. Both these are global effects.

• Enable warnings.

# -*- coding: utf-8 -*-
# perl

use warnings;
use warnings qw( FATAL utf8 );

• Declare that anything that opens a filehandles within this lexical scope.

# -*- coding: utf-8 -*-
# perl

use open qw( :encoding(UTF-8) :std );

• If you have a DATA handle, you must explicitly set its encoding. If you want this to be UTF‑8, then say: binmode(DATA, ":encoding(UTF-8)");

• Perl supports representing Unicode chars by name. Use the package “charnames”, like this:

# -*- coding: utf-8 -*-
# perl

use utf8;
use v5.12; # minimal for Unicode string feature
use charnames qw( :full ); # allow Unicode char be represented by name, for example, \N{CHARNAME}

print "\N{GREEK SMALL LETTER ALPHA}"; # same as "α"

Reference

If you have a question, put $5 at patreon and message me.

Perl

  1. Perl Overview
  2. Version String
  3. Help System

Detail

  1. Quoting String
  2. Format String
  3. String Operations
  4. True, False
  5. if then else
  6. Loop
  7. List / Array
  8. Loop Thru List
  9. Map f to List
  10. List Comprehension
  11. Hash Table
  12. Function Optional Param
  13. regex

Text Processing

  1. Unicode 🐪
  2. Convert File Encoding
  3. Read Write File
  4. Traverse Dir
  5. Find Replace
  6. Validate Local Links
  7. Split Line by Regex

Advanced

  1. Sort List, Matrix, Object
  2. Sort Matrix
  3. Sort Unstable
  4. Sort Misc
  5. List Modules, Search Paths
  6. Write a Module
  7. Complex Numbers
  8. System Call
  9. gzip
  10. Get Env Var
  11. GET Web Content
  12. Email