2017-07-17 accessed from http://groups.google.com/group/comp.emacs/browse_frm/thread/81eeecca2abba32f -------------------------------------------------- From: Erik Naggum Subject: Re: 19.35 --- when? Date: 1997/06/03 Message-ID: <3074322556264875@naggum.no> X-Deja-AN: 245745730 References: <3074265241506639@naggum.no> <3074308836405187@naggum.no> mail-copies-to: never Organization: Naggum Software; +47 2295 0313; http://www.naggum.no Newsgroups: comp.emacs * David Kastrup | How about a long term strategy of having the cake and eating it, too? I have proposed one solution that would make it possible to store one particular eight-bit character set using one byte, only. this isn't even very hard to implement. | Would it be possible to make the MULE stuff transparent? yes, but with a _lot_ of work. Emacs with MULE is basically built for Japanese conditions with delusions of international support. the MULE team does not understand European needs and habits. the fact that they think it's OK to use two bytes in a string or a buffer to represent an eight-bit character is sufficient evidence of that. | That is, store buffers and perhaps even strings not as words or even | longwords, but in some tagged byte format transparently? you mean 16- or 32-bit characters? yeah, I have proposed that. however, MULE as of today uses 19 bits per characters, not 16 like Unicode, so it would need 24 bits in the worst case. the coding system used internally in MULE is unlike anything I have seen before -- and I have worked with character set stuff for 10 years. in particular, it is not Unicode. MULE is a font-support system, not a large character set support system, and it exposes the internal encoding stuff to every piece of code that wants to use it. e.g., the Latin 1 inverted exclamation mark (�) is no longer encoded as character number 161 (#xA1), but as character number 2209 (#x008A1). let's look at the previously simple operation of inserting a string as quoted-printable. (defun insert-quoted-printable-string (string) (let ((i 0) (max (length string))) (while (< i max) (let ((char (aref string i))) (cond ((= char ?\n) (insert char)) ((or (<= char ?\ ) (>= char 127)) (insert (upcase (format "=%02x" char)))) (t (insert char)))) (setq i (1+ i))))) (insert-quoted-printable-string "�caramba!") =A1caramba! this code turns into the following character-set aware code. note the use of `sref' and `char-bytes' to read a whole char in a string and advance over it. (defun insert-quoted-printable-mule-string (string) (let ((i 0) (max (length string))) (while (< i max) (let* ((mule-char (sref string i)) (split (split-char mule-char)) (charset (car split)) (char (car (cdr split)))) (cond ((eq charset 'ascii) (cond ((= char ?\n) (insert char)) ((or (<= char ?\ ) (>= char 127)) (insert (upcase (format "=%02x" char)))) (t (insert char)))) ((eq charset 'latin-iso8859-1) (insert (upcase (format "=%02x" (+ 128 char))))) (t (error "Unrecognized character set: %s" charset))) (setq i (+ i (char-bytes mule-char))))))) (insert-quoted-printable-mule-string "\201�caramba!") =A1caramba! as you might guess, the \201 is the leading code for ISO 8859-1, inserted in front of every previously single-byte, eight-bit character. never mind that the codes that ISO character sets use are unused, and that the leading codes are all in the range 128-159, reserved by ISO for control characters. now, the following is of course what would happen if a string taken from a MULE buffer would produce if it were passed to the former function: (insert-quoted-printable-string "\201�caramba!") =81=A1caramba! if you have ever processed strings in Emacs or read characters from a buffer one by one to do something to it, you will find that MULE adds a whole new dimension to the Chinese curse "may you live in interesting times". and if you think MULE is _consistent_ in accepting eight-bit keyboard input as MULE characters, think again. when you visit a file, the buffer is first read once into a temporary buffer as bytes, then scanned with a variety of regexps to locate "coding systems" and replace the characters found with the internal multi-byte stuff. this takes a little time, of course. visiting my mail archive for the first quarter of 1997 (a 10.2M file) takes 2.6 seconds on my 50MHz SPARC 2 in my working version of Emacs 19.35 (Emacs 20.1 built entirely without MULE support), and 95 seconds with MULE's wondrous multi-byte support in the working version of Emacs 20.1 built with all the defaults. if you write an .el file that contains a "multibyte" character, it is written out in a format that is no longer readable by Emacs 19 or any other Lisp-reading software (such as etags). this is because MULE is so smart that it thinks it's better to read the source file into a buffer, do the coding system shenanigans alluded to above, and then sick the Lisp reader on the buffer, where the characters are read as one- to four-byte things and represented in the internal code. .elc files are written in the new, incompatible code which you can't use with Emacs 19 and vice versa. yes, this is true for your regular eight-bit characters, too, because they don't have eight-bit characters in Japan. MULE adds support for every language in the world at the cost of REMOVING the support for European eight-bit character sets. you gotta love it! I have wasted many weeks of my life trying to get Emacs 20 to work well enough for me to be able to use it at all. I decided to continue on 19.35 by cleaning out all the MULE shit, but too much of the code is written in a way that makes such cleaning very hard if I want to keep up with the other changes to Emacs, and that's what I'm trying to do. at five or six points I wanted to throw up my arms and leave Emacs and MULE to itself, but I guess I love working with Emacs too much to really mean it, 'cuz I'm still here and I'm still trying to undo the damage from the stampeding MULEs. | Or would that entail too many unfathomable changes, perhaps with too high | a speed penalty for operations like search and replace? turning off the tests for multibyteness in the primitive operations at compile time does have a noticeable effect on performance compared to keeping the run-time tests. however, these changes are minor. it's all the support junk that bloats Emacs and slows it down. | Would it be feasible to generally use byte-encoded files and/or strings, | but the moment one enters some non-European character, the buffer | explodes to twice its size. That way only people using MULE's features | get the buffer size penalties, yet every Emacs is MULE capable. yup, thought about that, but that would mean a significant slowdown in buffer access unless you built a lot of support functions to deal with a particular buffer and used "virtual functions" in the buffer "class" to do it. still, it would be a slow-down. | Any chances for something like this? well, I have completed the necessary changes to the C code locally to turn off the MULE stuff with a simple -DNOMULESHIT, but the Lisp code makes all kinds of unnecessary tests and work, as well. e.g., it's easy to get Emacs without MULE back to the old 7-bit Emacs that grew to handle European needs a few years ago, but it's much harder to get it to work acceptably with 8-bit characters, again. I'm trying to remove the C code's dependency on setup done in the Lisp code, and to make this stuff work without font sets and other sources of randomness. MULE is a long way from being a well-behaving international citizen, but there's probably no other way to make it well-behaving than to put it in the middle of the international scene and let it improve its act. XEmacs has done this right: offer MULE as an option at build-time. #\Erik -- if we work harder, will obsolescence be farther ahead or closer?