Python & Perl: Sort Matrix, Object, Math of Sorting

, ,

This page shows you how to sort a list in Python & Perl and also discuss some important math of ordering, and a optimization technique.

Sorting in Python

To sort a list in Python, use the “sort” method. For example:

# -*- coding: utf-8 -*-
# python

li = [1,9,2,3]
li.sort()
print li                        # [1, 2, 3, 9]

Note that sort is a method, and the variable is changed in-place.

Sorting a Matrix

Here's a example of sorting a matrix by second column:

# -*- coding: utf-8 -*-
# python

# sort a matrix

li = [[2,6],[1,3],[5,4]]

li.sort(lambda x, y: cmp(x[1],y[1]))

print li;                       #  [[1, 3], [5, 4], [2, 6]]

The argument to sort is a function of two arguments, that returns -1, 0, 1. This function is a decision function that tells sort() how to decide the order of any two elements in your list. If the first argument is “less” then second argument, the function should return -1. If equal, then 0. Else, 1.

If you want to sort without modifying the variable, use “sorted()” (Python 2.4).

Sort by Custom Order

Here's a more complex example. Suppose you have a list of strings.

'my283.jpg'
'my23i.jpg'
'web7-s.jpg'
'fris88large.jpg'
…

You want to sort them by the number embedded in them.

You need to define a ordering function, and pass it to sort. The function should takes two strings, and compare the integer inside the string. Here's the solution:

# -*- coding: utf-8 -*-
# python

# sort by custom order

import re

li = [
'my283.jpg',
'my23i.jpg',
'web7-s.jpg',
'fris88large.jpg',
]

# compare number inside string
def myComp (x,y):
    def getNum(str): return float(re.findall(r'\d+',str)[0])
    return cmp(getNum(x),getNum(y))

li.sort(myComp)

print li # returns ['web7-s.jpg', 'my23i.jpg', 'fris88large.jpg', 'my283.jpg']

Here, we defined a function “myComp” to tell sort about the ordering.

Math of Ordering

In general, the function f used to determine the order of any two element must satisfy some constraints:

If the comparison function does not behave as the above, then it is not consistent. It means that the result ordered list may actually be different depending the order of the original list or how the language happens to implement sort.

The significance of all this is that in real software you may want to sort a list of complex object by a specialized ordering. For example, you may want to sort a list of triangles in 3D space by their orientation. Or, you need to sort graphical user interface window objects in a Word Processor by their width. Or, sorting image selections in a photo-editing program. It is in advanced cases like these, understanding the basic math about ordering is important. Otherwise, you might have a bewildering result yet unable to locate any flaws in your code.

The order-deciding function given by the user is called a “predicate” in computer science. (A predicate function is any function that returns either true or false) And, it is usually defined inline using a pure function construct called “lambda”.

Python's “sort” method's optional parameters: “key” and “reverse”

Most of the time, sorting is done for a list of atomic element such as [3,2,4]. This is simply done by myList.sort() without any argument. Other than simple list, sort is frequently used on matrixes (⁖ [[2,6],[1,3],[5,4]]). For matrixes, almost always a particular column is used for the basis of ordering. For example, if we want to sort by second column, we do: li.sort(lambda x, y: cmp(x[1],y[1])). Since this is frequently used, Python provides a somewhat shorter syntax for it, by specifying the column used as the ordering “key”. For example:

li=[[2,6],[1,3],[5,4]]
li.sort(key=lambda x:x[1] ) # is equivalent to the following
#li.sort(lambda x, y: cmp(x[1],y[1]))
print li; # prints [[1, 3], [5, 4], [2, 6]]

Because the Python compiler is not very refined, this specialized syntax is algorithmically a order of magnitude faster than the general form lambda x, y: cmp(x[1],y[1]).

That is to say, the programer must now learn 2 syntaxes for the ordering function, one general and one specialized, and he must choose the right one otherwise he will be penalized by a magnitude of slow down. This also means that, in order to make the right choice of the syntax, the programer must known in advance of the nature of the list to be sorted.

Another idiosyncratic provision is the optional “reverse” argument. This parameter is necessary when using the “key” parameter. Normally, when using the order comparison function lambda(x,y), the ascending/descending nature can be changed by simply switching the parameters x and y. But now a single key=lambda(x) can't provide that, thus another optional parameter “reverse” is provided. Example:

The following are equivalent:

li.sort(key=lambda x:x[1], reverse=True )
li.sort(lambda x, y: cmp(x[1],y[1]), reverse=True)
li.sort(lambda x, y: cmp(y[1],x[1]))

Of course, one can reverse a list by the reverse() method for lists. But li.sort(…).reverse() is algorithmically another order of magnitude slower since Python compiler is not smart.

The official doc on Python's sort method is at (bottom): http://python.org/doc/2.4/lib/typesseq-mutable.html

Thanks to Lasse Vågsæther Karlsen for informing me the existence of the sorted() function in Python 2.4.

Sorting in Perl

here's a example of sort (Perl 5.8):

#-*- coding: utf-8 -*-
# perl

@li = (1,9,2,3);

@li2 = sort {$a <=> $b} @li; # original list is not changed

print join(' ', @li2); # 1 2 3 9

In Perl, sort is a function. It returns the sorted result as another list. This is very simple and nice.

It works like this: sort takes the form sort {…} @myList. Inside the enclosing braces is the body of the ordering function, where variables $a and $b inside are predefined by the language to represent two elements in the list. The operator <=> returns -1 if left operand is less than the right operand. If equal, it returns 0, else 1. It is equivalent to Python's “cmp” function.

Perl provides another comparison operator “cmp”, which treats its operands as strings. So, for example:

print "3" <=> "20"; # prints -1
print "3" cmp "20"; # prints 1

In Perl, numbers and strings are mutually automatically converted if needed.

Another form of sort is sort orderFunctionName @list, which uses a function name in place of the comparison block

Just for completeness, let's do a Perl equivalent of a Python example above.

#-*- coding: utf-8 -*-
#  perl

@li = (
'my283.jpg',
'my23i.jpg',
'web7-s.jpg',
'fris88large.jpg',
);

# sorts a list of strings by their embedded number

@li2 = sort { ($a=~m/(\d+)/)[0] <=> ($b=~m/(\d+)/)[0];} @li;

print join(' ', @li2);  # prints web7-s.jpg my23i.jpg fris88large.jpg my283.jpg

Side Note: Perl also has pure function construct. In Perl, a pure function is simply written as sub {…}, and applied to argument by the form pureFunction->(arg). For example, a function that squares a number and applied to 3 is written as: sub {$_^2} -> (3).

Unlike Python, Perl has no limitation in its pure function construct. Because of this, Perl supports better functional programing than Python (albeit still limited). Here's a simple example:

$result = sub($) {$_[0]+1}->(3);
print $result; # prints 4
# a value obtained by applying a pure function to 3.
# the pure function returns its argument+1

Perl, like Python, whose compiler is not very smart. In sorting, the ordering function is called unnecessarily repetitiously. Like Python, Perlers have sought means to avoid this penalty. If the programer knew in advance that his matrix is huge or knew in advance his ordering function, then he can code his sort by writing it using a workaround.

Here's how this work around works: Suppose you want to sort a huge list with a expensive ordering function. If you simply do sort orderingFunction @myList, then Perl is going to call your orderingFunction wantonly, and you lose. To beat Perl, you first generate a copy newList, whose elements are pairs (x,y), where x is from the original list, and y is the sorting key extracted from x. Then, you call Perl's sort function on this newList and using a ordering function based on the second list element (the y). Once this newList is sorted, you construct your original list by extracting the first element from this sorted newList. Here's the code example:

@li=('my283.jpg','my23i.jpg','web7-s.jpg','fris88large.jpg');

# the following two lines are equivalent
@li2 = sort { ($a=~m/(\d+)/)[0] <=> ($b=~m/(\d+)/)[0];} @li;
@li2 = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [ $_, ($_=~m/(\d+)/)[0] ] } @li;

In the above Perl code, the part map { [ $_, ($_=~m/(\d+)/)[0] ] } @li; generates the intermediate newList. Then, sort is applied to it, then, another map map { $_->[0] } gets the original list sorted.

In this way, the cost of the internals of the ordering function is avoided. (it runs on each list element once) However, your huge list is copied 1 extra time. So, there are pros and cons. Because this work around is inordinately unusual among imperative languages, in both its semantics and syntax, it has acquired a coolness factor among monkey coders, and is given the name Schwartzian Transform.

It is interesting to note what compiler flaws can do to imperative languages and its people. In Python, the language syntax is tainted. In Perl, a complex construct is invented. In both camps, the basic mathematics of sorting and its implementation aspects are completely belied.

perldoc -f sort

Language and Sort Optimization: decorate-sort-dedecorate

There are many algorithms for sorting. Each language can chose which to use. See wikipedia for a detailed list and explanation: Sorting algorithm.

However, does not matter which algorithm is used, ultimately it will need the order-deciding function on the items to be sorted. Suppose your items are (a,b,c,d,…), and your order-deciding function is F. Various algorithms will try to minimize the number of times F is called, but nevertheless, F will be applied to a particular element in the list multiple times. For example, F(a,b) may be called to see which of “a” or “b” comes first. Then, later the algorithm might need to call F(m,a), or F(a,z). The point here is that, F will be called many times on arbitrary two items in your list, even if one of the element has been compared to others before.

Now suppose, you are sorting some polygons in 3D space, or personal records by the person's address's distance from a location, or sorting matrixes by their eigen-values in some math application, or ordering files by number of occurrences of some text in the file.

In general, when you define your decision function F(x,y), you will need to extract some property from the elements to be sorted. For example, when sorting points in space by a criterion of distance, one will need to compute the distance for the point. When sorting personal records from database by the person's location, the decision function will need to retrieve the person's address from the database, then find the coordinate of that address, that compute the distance from there to a given coordinate. In sorting matrixes in math by eigen-values, the order-decision will first compute the eigen-value of the matrix. A common theme from all of the above is that they all need to do some non-trivial computation on each element.

As we can see, the order-decision function F may need to do some expensive computation on each element first, and this is almost always the case when sorting elements other than simple numbers. Also, we know that a sorting algorithm will need to call F(x,y) many times, even if one of x or y has been compared to others before. So, this may result high inefficiency. For example, you need to order people by their location to your home. So when F(Mary,Jane) is called, Mary's address is first retrieved from a database across a network, the coordinate of her address is looked up in another database. Then the distance to your home are computed using spherical geometry. The exact same thing is done for Jane. But later on, it may call F(Mary,Chrissy), F(Mary,Jonesy), F(Mary,Pansy) and so on, and the entire record retrieval for Mary is repeated many times.

One solution, is to do the expensive extraction one time for each element, then associate that with the corresponding elements. Suppose this expensive extraction function is called “gist()”. So, you create a new list ([Mary,gist(Mary)], [Jane,gist(Jane)], [John,gist(John)], [Jenny,gist(Jenny)], …) and sort this list instead, when done, remove associated gist. This technique is sometimes called decorate-sort-dedecorate.

In Perl programing, this decorate-sort-dedecorate technique is sillily known as Schwartzian Transform as we have demonstrated previously. In Python, they tried to incorporate this technique into the language, by adding the “key” optional parameter, which is our “gist()” function.

blog comments powered by Disqus