Emacs Lisp Power: Text-Soup Automation

By Xah Lee. Date: 2010-11-03

This page showcases a example of emacs lisp power, in dealing with text-soup processing that requires human interaction.

Problem

I have a favorite movies page. The page contain about 70 amazon links like this:

<a class="amz" href="http://www.amazon.com/dp/B000055Y0X/?tag=xahh-20">amazon</a>

It's a mystery what the link is, unless you visit the link. I want them to have a “title” attribute, like this:

<a class="amz" href="http://www.amazon.com/dp/B000055Y0X/?tag=xahh-20" title="Dr. Strangelove; movie">amazon</a>

It's a thorny problem. You have to write a script to fetch the amazon page then parse the result to get the product title then insert them at the right place. Amazon may block crawlers, and even if not, the parsing of the complex HTML to extract the title may take hours to code. You don't even know if product title is clearly marked by a specific tag.

Luckily, my page is written so that for each amazon link, the movie title is within the paragraph, preceding the link, and usually in the form of a Wikipedia link. Here's a sample paragraph:

<p><a href="http://en.wikipedia.org/wiki/To_sleep_with_a_vampire">To sleep with a vampire</a>
 (1993) ◇ Director: Adam Friedman.
<a class="amz" href="http://www.amazon.com/dp/B0000648YN/?tag=xahh-20">amazon</a>
</p>

Solution

So, the plan is to write a elisp script. Here's the basic steps:

open the file
find a amazon link.
search backward for the Wikipedia link that contains the movie title.
insert the “title” attribute in the amazon link.
Repeat.

This is a job perfect for elisp, and can be done interactively, far better than any Perl, Python, Ruby, due to emacs lisp's buffer system. I imagine it's a 20 min scripting job. Here's the code:

;; -*- coding: utf-8 -*-
;; 2010-11-03
;; add 「title="product title"」 to amazon links on a HTML page.

;; rough steps:
;; find amazon link of the form
;; <a class="amz" href="http://www.amazon.com/dp/B000055Y0X/?tag=xahh-20">amazon</a>

;; find a Wikipedia link above it, of this form
;; <a href="http://en.wikipedia.org/wiki/Dr._Strangelove">Dr. Strangelove</a>
;; extract the movie title

;; insert the attribute
;; title="…"
;; into the amazon link. Like this
;; <a class="amz" href="http://www.amazon.com/dp/B000055Y0X/?tag=xahh-20" title="Dr. Strangelove; movie">amazon</a>

(setq outputBuffer "*xah output*" )
(with-output-to-temp-buffer outputBuffer

  (find-file "~/web/xahlee_org/Periodic_dosage_dir/skina/nelci_skina.html" )
  (goto-char (point-min))

  (while
      (re-search-forward "<a class=\"amz\" href=\"http://www.amazon.com/dp/[^\"]+?\">amazon</a>"  nil t)

    (progn
      ;; set points for amazon link
      (backward-char 11)
      (setq amzLinkInsertPoint (point) )

      ;; get title from preceding Wikipedia link
      (re-search-backward "<a href=\"http://...wikipedia.org/wiki/[^\"]+?\">\\([^<]+?\\)</a>")
      (setq titleText (match-string 1 ) )

      (when (yes-or-no-p titleText)
        (goto-char amzLinkInsertPoint)
        (insert (concat " title=\"" titleText "; movie\"")) )
      )

    (progn (print "not found"))
    )

  (princ "Done deal!")
  )

Emacs is fantastic!

(In practice, the job took close to one hour to complete, counting all mistakes, and whatnot when actually coding. For example, in the process i noticed that 2 of the amazon links are preceded by Wikipedia links that are not actually related to the amazon link, and this and other miscellaneous irregularities are actually expected. The code above is actually slightly cleaned up, but is still meant to be one-time-use code. It always looks easy when seeing someone's published code than actually coding from scratch.)

There are few hundred amazon links on my site of 4k pages. They all need a similar fix. The job will be slightly different, because the links are arbitrary product or book names. But typically, the product name is usually marked like this 〈book title〉 or “song cd” or some other way in the text before the link, but not always. Also, some amazon links may already have a “title” attribute. The point is that it's a text-soup situation and requires human baby-sitting for correct completion, and elisp excels at this. Tomorrow or so, i'll write a elisp script to fix these few hundred amazon links among 4k pages. Total time for the task is expected to be 2 to 4 hours. (For a keyboard macro solution i needed to do in this, see: Emacs: Key Macro Example: Add HTML Attribute.)

2010-11-08

Aaron Culich wrote a elisp script that does the same thing but using several interesting techniques, among them is using DOM/XPATH in elisp to process HTML, and also yahoo's Yahoo Query Language (YQL), both of which i don't have any experience with. His code can be seen here: https://github.com/aculich/misc-elisp/blob/master/query-html.el

Here's a excerpt of his comment:

I often find myself having to do some xpath myself and since want to do this sort of thing inside emacs myself from time to time instead of busting out python, so I've been playing around with your Dr. Strangelove movie example (a favorite movie of mine, btw) using emacs and xpath. You can find my results here: https://github.com/aculich/misc-elisp

I tried using 3 methods.

(1) First with pure elisp using the dom/xpath stuff on emacswiki. Unfortunately the processing is broken and at least in some of the cases I tried, gobbled up all available memory. I didn't look at it closely, but I have a feeling the elisp implementation would a fair amount of work to get working and even still would probably not be very fast for large documents. Also, you still probably want to run the input through tidy first so that you're not dealing with broken HTML (which it seems nearly every website in the universe has).

(2) Using a few handy unix utilities not uncommon on most systems: wget, tidy, and xmlstarproc. You'll need to first install those before using this method.

(3) Yahoo's YQL web service is handy for this sort of thing. And the nice thing is that if you need to process a large document, all of it will be done remotely.

#3 is the default method that I use in my elisp code since it only relies on modules that ship with most recent versions of emacs (specifically json.el and url.el and w3m.el) and doesn't require any special binaries to be installed the way #2 does.

Also, since #1 was so broken I did not include any example implementation for it. Anyway, if you find the code useful, let me know.