Python: Find/Replace Mulitple Pairs of Strings

By Xah Lee. Date: . Last updated: .

The following script lets you do multiple find/replace pairs for all HTML files in a directory.

# -*- coding: utf-8 -*-
# python

# script for multiple find/replace pairs for all HTML files in a directory.

import os,sys

inputDir = "/home/jane/web"

findreplace = [
(u'<p>"' , u'<p>“'),
(u'" "' , u'” “'),
(u'!"' , u'!”'),
(u'?"' , u'?”'),
(u'\n"' , u'\n“'),
(u', "' , u', “'),
(u': "' , u': “'),
(u'."' , u'.”'),
(u',"' , u',”'),
(u'<p>' , u'\n<p>')
]

def replaceStringInFile(filePath):
   "replaces all findStr by repStr in file filePath"
   tempName = filePath+'~~~'

   inputFile = open(filePath)
   outputFile = open(tempName, 'w')
   fContent = unicode(inputFile.read(), "utf-8")

   for aPair in findreplace:
       outputText = fContent.replace(aPair[0], aPair[1])
       fContent = outputText

   outputFile.write(outputText.encode("utf-8"))

   outputFile.close()
   inputFile.close()

   print "processed {}".format(filePath)
   os.rename(tempName, filePath)

def fileFilter(dummyArg, thisDir, dirChildrenList):
    for thisChild in dirChildrenList:
        if '.html' == os.path.splitext(thisChild)[1] and os.path.isfile(thisDir+'/'+thisChild):
            replaceStringInFile(thisDir+'/'+thisChild)

os.path.walk(inputDir, fileFilter, None)

In novels, often, they use straight double quotes instead of curly quotes. I wrote this script to replace double quotes by curly ones. The algorithm is heuristic based on ajacent characters.

You can use this script for other tasks. For example, replace greek letter names by their actual letter alpha α, beta β, gamma γ, Pi π, Infinity ∞, etc.

Note: turning straight quotes into curly quotes is not a mechanical process. Partly because, in the convention of novel printing, the ending quotes are sometimes omitted when quotation is paragraph long. Therefore, you cannot assume that the quotes are matched. Even if they are, it's bad to make this assumption because one missing quote would screw up the rest of the text. So, instead, we use a heuristic approach, based on adjacent characters and make a guess whether that straight quote is the opening or closing quote. Proof reading still needs to be done afterwards.