Elisp: Find String Inside HTML Tag

By Xah Lee. Date: 2011-02-27. Last updated: 2019-01-16.

This page shows a emacs lisp script that search files, similar to unix grep, but with the condition that the string must happen inside a specific HTML tag.

Problem

I need to list all files that contains a given string, and only if the string is inside a given HTML tag. That is, the condition that some strig must happen before and after the string we want.

This is something grep and linux shell commands cannot do easily, and difficult to do even with Perl, Python, unless you use a HTML parser (which gets complex).

Solution

Here's the code:

;; -*- coding: utf-8 -*-
;; 2011-02-25, 2019-01-13
;; print files that meet this condition:
;; contains <div class="xnote">…</div>
;; where the text content contains more than one bullet char •

(setq inputDir "c:/Users/xah/web/xahlee_org/wordy/arabian_nights/" ) ; dir should end with a slash

;; need sgml-skip-tag-forward
(require 'sgml-mode)

(defun my-process-file (fPath)
  "Process the file at FPATH"
  (let (myBuffer
        p3 p4  (bulletCount 0) )

    ;; (print fPath)

    (when
        (and (not (string-match "/xx" fPath))) ; skip some dir

      (setq myBuffer (get-buffer-create " myTemp"))
      (set-buffer myBuffer)
      (insert-file-contents fPath nil nil nil t)

      (setq bulletCount 0 )
      (goto-char (point-min))
      (while
          (search-forward "<div class=\"xnote\">"  nil t)

        (setq p3 (point)) ; beginning of innerText, after <div class="xnote">
        (backward-char)
        (sgml-skip-tag-forward 1)
        (backward-char 6)
        (setq p4 (point)) ; end of innerText, before </div>

        (setq bulletCount (count-matches "•" p3 p4))

        (when (> bulletCount 1)
          (princ (format "Found: %d %s\n" bulletCount fPath))))

      (kill-buffer myBuffer))))

(let (outputBuffer)
  (setq outputBuffer "*my output*" )
  (with-output-to-temp-buffer outputBuffer
    (mapc 'my-process-file
          (directory-files-recursively inputDir "\.html$" ))
    (princ "Done")))

Find Replace Applications

Find, or Find Replace, has extensive use in text processing. Here's some examples of variations, all of which i need on weekly basis and have several elisp scripts to do the job:

List file that contains a string.
Show adjacent text around a string.
List a file only if it contains more than 1 occurence of a string. (or more than n, less than n, exactly n.)
List file if it contains a given set of strings.
Replace text based on file's name.
List file only if its HTML title and heading doesn't match.
Find/Report/Replace only if the string is at a particular position in the file. (e.g. near top, near bottom.)
List a file only if the string is inside a tag.

Why I Wrote This Code

Here's a little story on why i wrote this one.

I have about 30 classic literature with annotations. For example: The Arabian Nights.

Each annotation are in the tag <div class="xnote">…</div>. e.g.

<div class="xnote">• provaunt ⇒ provide. Provant is a verb meaning: To supply with provender or provisions.</div>

However, some “xnote” block is multiple annotations in one. e.g.

<div class="xnote">• stint ⇒ a fixed amount or share work.
• might and main ⇒ with all effort and strength.
• skein ⇒ A quantity of yarn, thread, or the like, put up together, after it is taken from the reel.
• buffet ⇒ hit, beat, especially repeatedly.
• fain ⇒ with joy; satisfied; contented.
</div>

Each of the annotation are marked by a bullet “•” symbol, followed by a word. Each word corresponds to the same word in the main text marked by <span class="xnt">…</span>.

This annotation system is not perfect. It is static HTML/CSS. Recently i've been thinking of making it more dynamic based on JavaScript. With JavaScript, it's possible to have features such as hide/show annotation when mouse over the the word.

[see JavaScript in Depth]

To make that possible, i need to make sure of few things:

My custom markup must have precise semantics.
The syntax should be as simple as possible. (else the JavaScript will have to do more work.)
The HTML annotation markup must follow strict form. (else JavaScript will fail silently)

With my current system, a annotation block is contained in a “xnote” tag, and within that block, each annotation is marked by a bullet. This semantic is precise, but isn't simple enough. If i want JavaScript to automatically highlight the annotation text when user mouse-over a annotated word, the js will have to do some parsing of text in the “xnote” block.

It would be simpler, if each “xnote” block contains just ONE annotation. This means, i will first change all my files that contain multi-annotation blocks to make them 1-annotation per xnote block. This is a text processing job. (Hello emacs lisp!)

Before doing text transformation on the xnote blocks, first i need to make sure the text has correct syntax. e.g. make sure that each “xnote” do indeed contain at least one bullet symbol, and make sure that each <span class="xnt">…</span> has a corresponding <div class="xnote">…</div>.

So, that's why i wrote this script. I wanted to get some idea of how many “xnote” blocks in which files actually contain multi-annotations.

The elisp code to split xnote block to multiple is very similar to the elisp code here.