Parallel Programing Problem: asciify-string

By Xah Lee. Date:

Here's a interesting parallel programing problem.

Problem Description

The task is to change this function so it's parallelizable. (code example in emacs lisp)

(defun asciify-string (inputStr)
  "Make Unicode string into equivalent ASCII ones."
  (setq inputStr (replace-regexp-in-string "á\\|à\\|â\\|ä" "a" inputStr))
  (setq inputStr (replace-regexp-in-string "é\\|è\\|ê\\|ë" "e" inputStr))
  (setq inputStr (replace-regexp-in-string "í\\|ì\\|î\\|ï" "i" inputStr))
  (setq inputStr (replace-regexp-in-string "ó\\|ò\\|ô\\|ö" "o" inputStr))
  (setq inputStr (replace-regexp-in-string "ú\\|ù\\|û\\|ü" "u" inputStr))

Here's a more general description of the problem.

You are given a Unicode text file that's a few peta bytes. For certain characters in the file, they need to be changed to different char. (For example of practical application, see: IDN homograph attackDuplicate characters in Unicode.)

One easy solution is to simply use regex, as the above sample code, to search thru the file sequentially, and perform the transform of a particular set of chars, then repeat this on the entire file, for each char that needs to be changed.

But your task is to use a parallelizable-algorithm. That is, in a parallel-algorithm aware language (For example, Fortress), the compiler will automatically span the computation to as many processors as there are (assume a thousand processors).

Refer to Guy Steele's video talk if you haven't seen already. See: Guy Steele on Parallel Programing.

Solution Suggestions

A better way to write it for parallel programing, is to map a char-transform function to each char in the string. Here's a pseudo-code in lisp by Helmut Eller:

(defun asciify-char (c)
  (case c
    ((? ? ? ?) ?a)
    ((? ? ? ?) ?e)
    ((? ? ? ?) ?i)
    ((? ? ? ?) ?o)
    ((? ? ? ?) ?u)
    (t c)))

(defun asciify-string (string) (map 'string #'asciify-string string))

One problem with this is that the function “asciify-char” itself is sequential, and not 100% parallelizable. (we might assume here that there are billions of chars in Unicode that needs to be transformed)

It would be a interesting small project, if someone actually use a parallel-algorithm-aware language to work on this problem, and report on the break-point of file-size of parallel-algorithm vs sequential-algorithm.

Anyone would try it? Perhaps in Fortress, SISAL, Erlang, Ease, Alice, X10, or other? Is Clojure parallel aware?

The Wikipedia article Implicit parallelism lists several other langs containing that feature.

If you have a question, put $5 at patreon and message me.