Perl: Sort Misc

By Xah Lee. Date: .

Sorting in Perl

here's a example of sort (Perl 5.8):

#-*- coding: utf-8 -*-
# perl

@li = (1,9,2,3);

@li2 = sort {$a <=> $b} @li; # original list is not changed

print join(' ', @li2); # 1 2 3 9

In Perl, sort is a function. It returns the sorted result as another list. This is very simple and nice.

It works like this: sort takes the form sort {…} @myList. Inside the enclosing braces is the body of the ordering function, where variables $a and $b inside are predefined by the language to represent two elements in the list. The operator <=> returns -1 if left operand is less than the right operand. If equal, it returns 0, else 1. It is equivalent to Python's “cmp” function.

Perl provides another comparison operator “cmp”, which treats its operands as strings. So, for example:

print "3" <=> "20"; # prints -1
print "3" cmp "20"; # prints 1

In Perl, numbers and strings are mutually automatically converted if needed.

Another form of sort is sort orderFunctionName @list, which uses a function name in place of the comparison block

Just for completeness, let's do a Perl equivalent of a Python example above.

#-*- coding: utf-8 -*-
#  perl

@li = (
'my283.jpg',
'my23i.jpg',
'web7-s.jpg',
'fris88large.jpg',
);

# sorts a list of strings by their embedded number

@li2 = sort { ($a=~m/(\d+)/)[0] <=> ($b=~m/(\d+)/)[0];} @li;

print join(' ', @li2);  # prints web7-s.jpg my23i.jpg fris88large.jpg my283.jpg

Side Note: Perl also has pure function construct. In Perl, a pure function is simply written as sub {…}, and applied to argument by the form pureFunction->(arg). For example, a function that squares a number and applied to 3 is written as: sub {$_^2} -> (3).

Unlike Python, Perl has no limitation in its pure function construct. Because of this, Perl supports better functional programing than Python (albeit still limited). Here's a simple example:

$result = sub($) {$_[0]+1}->(3);
print $result; # prints 4
# a value obtained by applying a pure function to 3.
# the pure function returns its argument+1

Perl, like Python, whose compiler is not very smart. In sorting, the ordering function is called unnecessarily repetitiously. Like Python, Perlers have sought means to avoid this penalty. If the programer knew in advance that his matrix is huge or knew in advance his ordering function, then he can code his sort by writing it using a workaround.

Here's how this work around works: Suppose you want to sort a huge list with a expensive ordering function. If you simply do sort orderingFunction @myList, then Perl is going to call your orderingFunction wantonly, and you lose. To beat Perl, you first generate a copy newList, whose elements are pairs (x,y), where x is from the original list, and y is the sorting key extracted from x. Then, you call Perl's sort function on this newList and using a ordering function based on the second list element (the y). Once this newList is sorted, you construct your original list by extracting the first element from this sorted newList. Here's the code example:

@li=('my283.jpg','my23i.jpg','web7-s.jpg','fris88large.jpg');

# the following two lines are equivalent
@li2 = sort { ($a=~m/(\d+)/)[0] <=> ($b=~m/(\d+)/)[0];} @li;
@li2 = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [ $_, ($_=~m/(\d+)/)[0] ] } @li;

In the above Perl code, the part map { [ $_, ($_=~m/(\d+)/)[0] ] } @li; generates the intermediate newList. Then, sort is applied to it, then, another map map { $_->[0] } gets the original list sorted.

In this way, the cost of the internals of the ordering function is avoided. (it runs on each list element once) However, your huge list is copied 1 extra time. So, there are pros and cons. Because this work around is inordinately unusual among imperative languages, in both its semantics and syntax, it has acquired a coolness factor among monkey coders, and is given the name Schwartzian Transform.

It is interesting to note what compiler flaws can do to imperative languages and its people. In Python, the language syntax is tainted. In Perl, a complex construct is invented. In both camps, the basic mathematics of sorting and its implementation aspects are completely belied.

perldoc -f sort

Language and Sort Optimization: decorate-sort-dedecorate

There are many algorithms for sorting. Each language can chose which to use. See wikipedia for a detailed list and explanation: Sorting algorithm .

However, does not matter which algorithm is used, ultimately it will need the order-deciding function on the items to be sorted. Suppose your items are (a,b,c,d,…), and your order-deciding function is F. Various algorithms will try to minimize the number of times F is called, but nevertheless, F will be applied to a particular element in the list multiple times. For example, F(a,b) may be called to see which of “a” or “b” comes first. Then, later the algorithm might need to call F(m,a), or F(a,z). The point here is that, F will be called many times on arbitrary two items in your list, even if one of the element has been compared to others before.

Now suppose, you are sorting some polygons in 3D space, or personal records by the person's address's distance from a location, or sorting matrixes by their eigen-values in some math application, or ordering files by number of occurrences of some text in the file.

In general, when you define your decision function F(x,y), you will need to extract some property from the elements to be sorted. For example, when sorting points in space by a criterion of distance, one will need to compute the distance for the point. When sorting personal records from database by the person's location, the decision function will need to retrieve the person's address from the database, then find the coordinate of that address, that compute the distance from there to a given coordinate. In sorting matrixes in math by eigen-values, the order-decision will first compute the eigen-value of the matrix. A common theme from all of the above is that they all need to do some non-trivial computation on each element.

As we can see, the order-decision function F may need to do some expensive computation on each element first, and this is almost always the case when sorting elements other than simple numbers. Also, we know that a sorting algorithm will need to call F(x,y) many times, even if one of x or y has been compared to others before. So, this may result high inefficiency. For example, you need to order people by their location to your home. So when F(Mary,Jane) is called, Mary's address is first retrieved from a database across a network, the coordinate of her address is looked up in another database. Then the distance to your home are computed using spherical geometry. The exact same thing is done for Jane. But later on, it may call F(Mary,Chrissy), F(Mary,Jonesy), F(Mary,Pansy) and so on, and the entire record retrieval for Mary is repeated many times.

One solution, is to do the expensive extraction one time for each element, then associate that with the corresponding elements. Suppose this expensive extraction function is called “gist()”. So, you create a new list ([Mary,gist(Mary)], [Jane,gist(Jane)], [John,gist(John)], [Jenny,gist(Jenny)], …) and sort this list instead, when done, remove associated gist. This technique is sometimes called decorate-sort-dedecorate.

In Perl programing, this decorate-sort-dedecorate technique is sillily known as Schwartzian Transform as we have demonstrated previously. In Python, they tried to incorporate this technique into the language, by adding the “key” optional parameter, which is our “gist()” function.