Parallel Programing Problem: asciify-string

By Xah Lee. Date: . Last updated: .

Here is a interesting parallel programing problem.

Problem Description

The task is to change this function so it's parallelizable. (code example in emacs lisp)

(defun asciify-string (inputStr)
  "Make Unicode string into equivalent ASCII ones."
  (setq inputStr
        (replace-regexp-in-string
         "á\\|à\\|â\\|ä" "a" inputStr))
  (setq inputStr
        (replace-regexp-in-string
         "é\\|è\\|ê\\|ë" "e" inputStr))
  (setq inputStr
        (replace-regexp-in-string
         "í\\|ì\\|î\\|ï" "i" inputStr))
  (setq inputStr
        (replace-regexp-in-string
         "ó\\|ò\\|ô\\|ö" "o" inputStr))
  (setq inputStr
        (replace-regexp-in-string
         "ú\\|ù\\|û\\|ü" "u" inputStr))
  inputStr
  )

Here is a more general description of the problem.

You are given a Unicode text file that's a few peta bytes. For certain characters in the file, they need to be changed to different char. (examples of practical application are IDN Homograph Attack and Duplicate Characters in Unicode. )

One easy solution is to simply use regex, as the above sample code, to search thru the file sequentially, and perform the transform of a particular set of chars, then repeat this on the entire file, for each char that needs to be changed.

But your task is to use a parallelizable algorithm. That is, in a parallel algorithm aware language (For example, Fortress), the compiler will automatically span the computation to as many processors as there are (assume a thousand processors).

Refer to Guy Steele's video talk if you haven't seen already. See: Guy Steele on Parallel Programing .

Solution Suggestions

A better way to write it for parallel programing, is to map a char-transform function to each char in the string. Here's a pseudo-code in lisp by Helmut Eller:

(defun asciify-char (c)
  (case c
    ((? ? ? ?) ?a)
    ((? ? ? ?) ?e)
    ((? ? ? ?) ?i)
    ((? ? ? ?) ?o)
    ((? ? ? ?) ?u)
    (t c)))

(defun asciify-string (string) (map 'string #'asciify-string string))

One problem with this is that the function “asciify-char” itself is sequential, and not 100% parallelizable. (we might assume here that there are billions of chars in Unicode that needs to be transformed)

It would be a interesting project to use a parallel language on this problem, and report on the break-point of file-size of parallel-algorithm vs sequential-algorithm.

For example, languages like Clojure, Fortress, SISAL, Erlang, Ease, Alice, X10.

See also: