Text Processing: Emacs Lisp vs Perl

By Xah Lee. Date: 2007-10-30

It is widely known that Perl is the best text processing language there is. In this essay, i like to argue that emacs lisp is in fact more powerful in text processing tasks.

I worked as a Perl programer since 1998, using it daily in a day job writing web application servers and sys admin using Perl on Solaris. I started to use emacs daily since 1998, and started to study elisp as a hobby since 2005. It is only today, while i was studying elisp's file and buffer related functions, that i realized how elisp can be used as a general text processing language, and in fact is a dedicated language for this task, more powerful and convenient than Perl (or Python).

[see Perl Tutorial]

This realization surprised me, because it is well-known that Perl is the de facto language for text processing, and emacs lisp for this is almost unknown. The surprise was exasperated by the fact that Emacs Lisp existed before Perl by almost a decade. (Albeit Emacs Lisp cannot be used for writing applications outside of emacs.)

My study about lisp as a text processing tool today, remind me of a article i read in 2000-09:

[Ilya Regularly Expresses 2000-09-20 By Joe Johnston. At ~~http://www.perl.com/lpt/a/2000/09/ilya.html~~ ] (local copy Perl, Dr. Ilya Zakharevich Regularly Expresses )

, of a interview with Dr Ilya Zakharevich (author of cperl-mode.el and a major contributor to the regex features in Perl). In the article, he mentioned something about Perl's lack of text processing primitives that are in emacs, which i did not understand at the time. (i didn't know elisp at the time.)

Here's the relevant excerpt:

Let me also mention that classifying the text handling facilities of Perl as “extremely agile” gives me the willies. Perl's regular expressions are indeed more convenient than in other languages. However, the lack of a lot of key text-processing ingredients makes Perl solutions for many averagely complicated tasks either extremely slow, or not easier to maintain than solutions in other languages (and in some cases both).

I wrote a (heuristic-driven) Perlish syntax parser and transformer in Emacs Lisp, and though Perl as a language is incomparably friendlier than Lisps, I would not be even able of thinking about rewriting this tool in Perl: there are just not enough text-handling primitives hardwired into Perl. I will need to code all these primitives first. And having these primitives coded in Perl, the solution would turn out to be (possibly) hundreds times slower than the built-in Emacs operations.

My current conjecture on why people classify Perl as an agile text-handler (in addition to obvious traits of false advertisements) is that most of the problems to handle are more or less trivial (“system maintenance”-type problems). For such problems Perl indeed shines. But between having simple solutions for simple problems and having it possible to solve complicated problems, there is a principle of having moderately complicated solutions for moderately complicated problems. There is no reason for Perl to be not capable of satisfying this requirement, but currently Perl needs improvement in this regard.

Note: Ilya wrote emacs's cperl-mode.el. The elisp source code is close to 9000 lines.

2008-01-29

Why Emacs Lisp Is More Powerful Than Perl For Text Processing

In the following, i give some technical details on why emacs elisp is more powerful for text processing than Perl.

Emacs's Buffer Data Type

In emacs, there's the “buffer” data type and associated infrastructure, which allows programer to navigate a pointer (i.e. cursor), to any place in the text file, by using high-level functions. For example, you can move the pointer by number of characters, or jump to the position of a particular character or string or text pattern (by regex). You can move the point forward or backward freely, or move up/down by lines. Further, there are over 3 thousands text-processing functions build-in, from various language modes, to do various type of text manipulation. (e.g. deleting tags in HTML/XML, navigate or manipulate matching pair units (as in lisp source code).)

In Perl, typically you read in the file one line at a time and process it one line at a time, or read the whole file one shot into a array but basically still process it one line at a time. Any function you want to apply to the text is only applied to a line at a time, and it can't see what's before or after the line. (Of course, you could code up a buffer in your program by accumulating incoming lines and flush older lines. Alternatively, you could treat the file as a input stream and read in one char at a time as well move the index back and forth, but then that loses all the high level power of dealing the data as strings or using regex on it.)

The problem with processing one-line at a time is that, many text involves nesting to some degree. For example, in Perl, Java, JavaScript, C, there are mixed nested curly braces and parens, especially in “for” loops. Language source code of Lisp, Mathematica, HTML, XML are entirely nested.

To process texts that are not just simple lines, processing it line by line is almost useless. You need to know what goes on before and after the current line. So, in Perl, the typical solution is to read in the whole file as a single string, and apply regex to the whole content. This put stress on the regex and drastically reduces what can be done. But more importantly, regex is not capable of parsing even simple nested structure.

A alternative solution to process text other than simple lines, such as XML, is to use a proper parser module. However, when using a parser, the nature of programing ceases to be text-processing but more as structural manipulation. In general, the program becomes more complex. Also, if you use a XML parser or DOM, the formatting of the file will be lost. (i.e. the file's placement of line endings and indents will be gone) With a XML parser or DOM, you are no longer doing text processing.

This is a major reason why, i think emacs lisp's is far more versatile because it can read in the XML into emacs's buffer datatype, then the programer can move back and forth a point, freely using regex to search or replace text back and forth. For complex XML processing such as tree transformation (e.g. XSLT etc), a XML/DOM parser/model is still more suitable, but for most simple manipulation (such as processing HTML files), using elisp's buffer and treating it as text is far easier and flexible. Also, if one so wishes, she can use a XML/DOM parser written in elisp, just as in Perl.

For concrete examples, see:

2008-12-30

Addendum:

In fairness to Perl, Python, PHP, or similiar languages in compared to elisp, here are the advantages of these languages for text processing:

(1) Elisp cannot read a file one line at a time, it must load the whole file into buffer memory.
(2) Emacs cannot open large files where the number of chars is larger than its integer representation (2^29; or ~536 megabytes; as of emacs 23.2).
(3) Elisp lacks libraries in comparison to Perl. The libraries elisp lacks are outside of text processing. However, text processing jobs often involve non-text-processing tasks, such as networking, various internet protocols, formats, web dev frameworks, fast language parsers that require embedded C, etc.

Item (1) and (2) together means that, if your text files are often few hundred megabytes (e.g. log files), emacs is not suitable.

Here are some major examples of elisp as a text processing lang. Each of the following is about 10k lines of elisp:

js2-mode [~~http://code.google.com/p/js2-mode/~~] by Steve Yegge [http://steve-yegge.blogspot.com/]. Features a full JavaScript parser that validates js syntax as you type.
ejacs [~~http://code.google.com/p/ejacs/~~]. A JavaScript interpreter written entirely in Emacs Lisp, by Steve Yegge.
nxml-mode [thaiopensource.com/nxml-mode]. A XML mode by James Clark (author of groff and expat). Features a full XML parser that does validation as you type.
CEDET [http://cedet.sourceforge.net/] by Eric Ludlam and JDEE [http://jdee.sourceforge.net/] by Paul Kinnucan. A IDE system for Java development.
SLIME. A IDE for Common Lisp. By Eric Marsden, Luke Gorrie, Helmut Eller.

Note: several major modes, such as c-mode (for C, C++, and function as a engine for any language with C-like syntax), cperl-mode for Perl by Ilya Zakharevich, are also in the order of 10k lines of elisp.

Why Emacs is Still so Useful Today