Programing Exercise, Validate Matching Brackets

By Xah Lee. Date: . Last updated: .

This is a preliminary report on scripts of several languages to validate matching brackets.

Programing Challenge: Matching Pairs Validation

The problem is to write a script that can check a dir of text files (and all subdirs) and reports if a file has any mismatched matching brackets.

Here is a example of mismatched bracket: ([)], (“[”), ((, etc. (and yes, the brackets may be nested. There are usually text between these chars.)

I'll be writing a emacs lisp solution and post in 2 days. Ι welcome other lang implementations. In particular, perl, python, php, ruby, tcl, lua, Haskell, Ocaml. I'll also be able to eval common lisp (clisp) and Scheme lisp (scsh), Java. Other lang such as Clojure, Scala, C, C++, or any others, are all welcome, but i won't be able to eval it. JavaScript implementation will be very interesting too, but please indicate which and where to install the command line version.

I hope you'll find this a interesting “challenge”. This is a parsing problem. I haven't studied parsers except some Wikipedia reading, so my solution will probably be naive. I hope to see and learn from your solution too.

i hope you'll participate. Just post solution here. Thanks.

Solutions

Emacs Lisp

Emacs Lisp: Batch Script to Validate Matching Brackets .

Jon Snader (jcs) wrote 2 versions in elisp. See: [Xah's Challenge (Part 2) By Jon Snader. At http://irreal.org/blog/?p=165 , accessed on 2011-07-23 ]. Ι haven't studied it yet.

Python

# python 3

# spec at http://xahlee.org/comp/validate_matching_brackets.html
# 2011-07-17
# by Raymond Hettinger

input_dir = "/Users/xah/web/xahlee_info/comp/"

import os, re

def check_balance(characters):
    '''Return -1 if all delimiters are balanced or
       the char number of the first delimiter mismatch.

    '''
    openers = {
        '(': ')',
        '{': '}',
        '[': ']',
        '“': '”',
        '‹': '›',
        '«': '»',
        '【': '】',
        '〈': '〉',
        '《': '》',
        '「': '」',
        '『': '』',
        }
    closers = set(openers.values())
    stack = []
    for i, c in enumerate(characters, start=1):
        if c in openers:
            stack.append(openers[c])
        elif c in closers:
            if not stack or c != stack.pop():
                return i
    if stack:
        return i
    return -1

def scan(directory, encoding='utf-8'):
    for dirpath, dirnames, filenames in os.walk(directory):
        for filename in filenames:
            fullname = os.path.join(dirpath, filename)
            if re.search(r'\.txt$', filename, re.U) and os.path.isfile(fullname):
                print ( "processing:" + fullname)
                with open(fullname, 'r', encoding=encoding) as f:
                    try:
                        characters = f.read()
                    except UnicodeDecodeError:
                        continue
                position = check_balance(characters)
                if position >= 0:
                    print('{0!r}: {1}'.format(position, fullname))

scan(input_dir)

print ("done")

Original post at [a little parsing challenge ☺ At http://groups.google.com/group/comp.lang.python/browse_frm/thread/c52aee1e4392436e/]

Thanks to the many who have written code and made helpful comments. I may come back to clean this up, in the coming weeks. If you can correct one of the following programs, please comment.

Pending Solutions

Python

This one, so far didn't work, possibly unix/Windows issues.

# -*- coding: utf-8 -*-
# python 3

# spec at http://xahlee.org/comp/validate_matching_brackets.html
# 2011-07-18
# by Thomas Jollans
# https://gist.github.com/1087682
# https://gist.github.com/1087682/2240a0834463d490c29ed0f794ad15128849ff8e

"""
Python 3 solution to Xah Lee's infamous crosspost of July 17 2011.

License; Chicken Dance License version 0.2 or newer.
For the license and dance, see https://github.com/supertunaman/cdl
"""

input_dir = "c:/Users/h3/web/xahlee_org/p/time_machine"

import os
import os.path
from concurrent.futures import ProcessPoolExecutor
from functools import partial
from collections import deque

openers = {
    '(': ')',
    '{': '}',
    '[': ']',
    '“': '”',
    '‹': '›',
    '«': '»',
    '【': '】',
    '〈': '〉',
    '《': '》',
    '「': '」',
    '『': '』',
}

brackets = set(openers.keys()) | set(openers.values())

def checkbracket(arg):
    col, char = arg
    if char in brackets:
        return col, char

def getbrackets(line):
    return [x for x in
            map(checkbracket, enumerate(line))
            if x is not None]

def process_file(filename):
    f = open(filename, 'r', encoding='utf-8')
    stack = deque()
    for line_no, line in enumerate(map(getbrackets, f)):
        for col, bracket in line:
            if bracket in openers:
                stack.append((bracket, line_no, col))
            else: #closer
                in_, l, c = stack.pop()
                if bracket != openers[in_]:
                    print('{0}:{1}:{2}: expecting {3} found {4}'
                            .format(filename, line_no, col,
                                openers[in_], bracket))
                    return False
    if stack:
        in_, l, c = stack.pop()
        print('{0}:{1}:{2}: expecting {3} found {4}'
                .format(filename, line_no, col,
                    openers[in_], 'EOF'))
        return False
    else:
        return True

def main():
    with ProcessPoolExecutor() as ex:
        for dirpath, dirnames, filenames in os.walk(input_dir):
            for filename in filenames:
                fullname = os.path.join(dirpath, filename)
                print (fullname)
                ex.submit(process_file, fullname)

if __name__ == '__main__':
    main()

doesn't do Unicode and doesn't report error position.

# -*- coding: utf-8 -*-
# python 2.6
# spec at http://xahlee.org/comp/validate_matching_brackets.html
# 2011-07-18
# by Billy Mays
# original at http://groups.google.com/group/comp.lang.python/msg/4bfb61c3346ecad5

input_dir = "c:/Users/h3/web/xahlee_org/p/time_machine"

import sys, os

pairs = {
'}':'{',
')':'(',
']':'[',
'>':'<',
'”':'“',
'›':'‹',
}
valid = set( v for pair in pairs.items() for v in pair )

for dirpath, dirnames, filenames in os.walk(input_dir):
     for name in filenames:
         stack = [' ']
         with open(os.path.join(dirpath, name), 'rb') as f:
             chars = (c for line in f for c in line if c in valid)
             for c in chars:
                 if c in pairs and stack[-1] == pairs[c]:
                     stack.pop()
                 else:
                     stack.append(c)
         print ("Good" if len(stack) == 1 else "Bad") + ': %s' % name

doesn't seem to be correct. No report on bad file.

# -*- coding: utf-8 -*-
# spec at http://xahlee.org/comp/validate_matching_brackets.html

'''
Created on 2011-07-18
@author: Thomas 'PointedEars' Lahn <PointedE...@web.de>, based on an idea of
Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9...@myhashismyemail.com>
in <news:j01ph6$knt$1@speranza.aioe.org>
http://groups.google.com/group/comp.lang.python/msg/7f214bd222650c1b
'''

input_dir = "c:/Users/h3/web/xahlee_org/p/time_machine"

import sys, os
pairs = {
u'}': u'{',
u')': u'(',
u']': u'[',
u'”': u'“',
u'›': u'‹',
u'»': u'«',
u'】': u'【',
u'〉': u'〈',
u'》': u'《',
u'」': u'「',
u'』': u'『'}

valid = set(v for pair in pairs.items() for v in pair)
if __name__ == '__main__':
    for dirpath, dirnames, filenames in os.walk(input_dir):
        for name in filenames:
            stack = [' ']
            # you can use chardet etc. instead
            encoding = 'utf-8'
            with open(os.path.join(dirpath, name), 'r') as f:
                reported = False
                chars = ((c, line_no, col) for line_no, line in enumerate(f)
for col, c in enumerate(line.decode(encoding)) if c in valid)
                for c, line_no, col in chars:
                    if c in pairs:
                        if stack[-1] == pairs[c]:
                            stack.pop()
                        else:
                            if not reported:
                                first_bad = (c, line_no + 1, col + 1)
                                reported = True
                    else:
                        stack.append(c)
            if len(stack) != 1: print '%s: %s' % (name, ("bad '%s' at %s:%s" % first_bad))

Ruby

#!/usr/bin/env ruby19
# -*- coding: utf-8 -*-

# spec at http://xahlee.org/comp/validate_matching_brackets.html
# 2011-07-18 by Robert Klemme
# https://gist.github.com/1087583
# http://groups.google.com/group/comp.lang.lisp/msg/6f3f518e6782c3c9

# Approach with tr in order to avoid too much backtracking
# in the regexp and keep runtime realistic.

# my approach summarized
# - traverse a directory tree
# - for each found item of type "file"
# -    read the whole content
# -    throw it at a regexp which is anchored at the beginning
#       and does the recursive parsing
# -    report file if the match is shorter than the file
# Note: special feature for recursive matching is used which Perl's regexp
# engine likely can do as well but many others don't.

require 'find'

ARGV.each do |dir|
  Find.find dir do |f|
    next unless test ?f, f

    content = File.read(f, :encoding => 'UTF-8')

    md = %r{
      \A
      (?<tl>
        [^(){}\[\]“”‹›«»【】〈〉《》「」『』]++
        | \( \g<tl>* \)
        | \{ \g<tl>* \}
        | \[ \g<tl>* \]
        | “ \g<tl>* ”
        | ‹ \g<tl>* ›
        | « \g<tl>* »
        | 【 \g<tl>* 】
        | 〈 \g<tl>* 〉
        | 《 \g<tl>* 》
        | 「 \g<tl>* 」
        | 『 \g<tl>* 』
      )*
    }x.match content

    len = md[0].length
    printf "file %s, pos %d\n", f, len if len < content.length
  end
end

Incomplete. Doesn't read dir. Doesn't do Unicode.

# -*- coding: utf-8 -*-
# Ruby

# 
# spec at http://xahlee.org/comp/validate_matching_brackets.html
# 2011-07-18
# by “WJ” (w_a_x…@yahoo.com)


fences = '(){}[]<>'
fence_pat = Regexp.new( "[" + Regexp.escape( fences ) + "]" )
stack = []
gets( nil ).scan( fence_pat ){|char|
  ((p = fences.index( char)).even?  and  stack.push char)  or
  (o = stack.pop
   raise "Unmatched: #{o} vs. #{char}" unless o == fences[p-1,1])}
puts "Unmatched:", stack  unless stack.empty?

Perl

# perl
# spec at http://xahlee.org/comp/validate_matching_brackets.html
# 2011-07-18
# by Rouslan Korneychuk
# http://groups.google.com/group/comp.lang.python/msg/e5aa0217bd1bd587

use strict;
use utf8;
binmode STDOUT,":utf8";
use File::Find;

my $checkre = qr/(?|
    (\()(?&matched)([\}\]”›»】〉》」』]|$) |
    (\{)(?&matched)([\)\]”›»】〉》」』]|$) |
    (\[)(?&matched)([\)\}”›»】〉》」』]|$) |
    (“)(?&matched)([\)\}\]›»】〉》」』]|$) |
    (‹)(?&matched)([\)\}\]”»】〉》」』]|$) |
    («)(?&matched)([\)\}\]”›】〉》」』]|$) |
    (【)(?&matched)([\)\}\]”›»〉》」』]|$) |
    (〈)(?&matched)([\)\}\]”›»】》」』]|$) |
    (《)(?&matched)([\)\}\]”›»】〉」』]|$) |
    (「)(?&matched)([\)\}\]”›»】〉》』]|$) |
    (『)(?&matched)([\)\}\]”›»】〉》」]|$) |
    (^)(?&matched)([\)\}\]”›»】〉》」』]))
(?(DEFINE)(?<matched>(?:
    \((?&matched)\) |
    \{(?&matched)\} |
    \[(?&matched)\] |
    “(?&matched)” |
    ‹(?&matched)› |
    «(?&matched)» |
    【(?&matched)】 |
    〈(?&matched)〉 |
    《(?&matched)》 |
    「(?&matched)」 |
    『(?&matched)』 |
    [^\(\{\[“‹«【〈《「『\)\}\]”›»】〉》」』]++)*+))
/sx;

sub check_file {
    if(-f && /\.txt$/) {
        if(open(my $fh,'<:encoding(UTF-8)',$_)) {
            undef $/;
            my $data = <$fh>;
            if($data =~ $checkre) {
                if(!length $2) {
                    print "File $File::Find::name has unclosed bracket $1 at position $-[1]\n";
                }
                elsif(!length $1) {
                    print "File $File::Find::name has unopened bracket $2 at position $-[2]\n";
                }
                else {
                    print "File $File::Find::name has mismatched brackets $1 $2 at positions $-[1] and $-[2]\n";
                }
            }
        } else {
            print STDERR "Cannot open $File::Find::name: $!\n";
        }
    }
}

@ARGV = ('.') unless @ARGV;
find(\&check_file,@ARGV);

__END__

[on the regex part]

If the pattern matches, there is a mismatched bracket. $1 is set to the
mismatched opening bracket. $-[1] is its location. $2 is the mismatched
closing bracket or '' if the bracket was never closed. $-[2] is set to
the location of the closing bracket or the end of the string if the
bracket wasn't closed.

I didn't write all that manually; it was generated with this:

my @open = ('\(','\{','\[','“','‹','«','【','〈','《','「','『');
my @close = ('\)','\}','\]','”','›','»','】','〉','》','」','』');

'(?|'.join('|',map
{'('.$open[$_].')(?&matched)(['.join('',@close[0..($_-1),($_+1)..$#close]). ']|$)'}
(0 .. $#open)).')(?(DEFINE)(?<matched>(?:'.join('|',map
{$open[$_].'(?&matched)'.$close[$_]} (0 ..
$#open)).'|[^'.join('',@open,@close).']++)*+))'

Common Lisp

;; common lisp
;; spec at http://xahlee.org/comp/validate_matching_brackets.html
;; 2011-07-17
;; By Madhu <enom...@meer.net>
;; http://groups.google.com/group/comp.lang.lisp/msg/8cdb2b6b55c9af93

;; Just for LOLS (and just using the Common Lisp reader), using CCL as a
;; unicode-capable lisp with the :UTF-8 external-format,
;; i.e. "ccl -K :UTF-8":

(defvar +matching-pair-string+
  (delete #\Space "() {} [] “” ‹› «» 【】 〖〗 〈〉 《》 「」 『』"))

(defvar *matching-delim* nil)
(defvar *offending-position* nil)

(defun delim-reader (stream char &aux (i (position char +matching-pair-string+)))
  (assert (evenp i) nil
          "Unexpected ~C~@[ at file-position ~D~].~@[ Wanted ~C.~]~@[ Start ~D.~]~&"
          char (ignore-errors (file-position stream))
          *matching-delim*  *offending-position*)
  (let ((*matching-delim* (elt +matching-pair-string+ (1+ i)))
        (*offending-position* (ignore-errors (file-position stream))))
    (read-delimited-list *matching-delim* stream t) nil))

(defun lee-readtable (&optional (readtable (copy-readtable nil)))
  (loop for c across ;; 2.1.3 standard characters (overkill, but
                     ;; should include all macro characters...)
        "aAbBcCdDeEfFgGhHiIjJkKlLmM12345nNoOpPqQrRsStTuUvVwWxXyYzZ1234567890!$\"'() ,_-./:;?+<=>#%&*@[\\]{|}`^~"       do
        (set-macro-character c nil t readtable)
        (set-syntax-from-char c #\Space readtable))
  (loop for c across +matching-pair-string+
        do (set-macro-character c 'delim-reader nil readtable))
  readtable)

(defvar $lee-package (make-package "LEEDUMMY"))
(defparameter $lee-readtable (lee-readtable))

(defun lee-read-file (pathname &aux (*readtable* $lee-readtable)
        (*package* $lee-package))
  (with-open-file (stream pathname :external-format :utf-8)
    (read stream)))

#+nil
(with-open-file (stream "/tmp/1" :direction :output :if-exists :supersede
                        :external-format :utf-8)
  (write-string "([)], (“[[”), ((, 】" stream))

#+nil
(lee-read-file "/tmp/1")

;; > Error: Unexpected ) at file-position 3. Wanted ]. Start 2.
#!/usr/bin/clisp -ansi

;; spec at http://xahlee.org/comp/validate_matching_brackets.html
;; 2011-07-18
;; Pascal J Bourguignon
;; http://groups.google.com/group/comp.lang.lisp/msg/c0989e261d6c0ea0
;; using ccl http://www.clozure.com/clozurecl.html

;; You only need a trivial push-down automata:
(loop
   :with parentheses = "(){}[]“”‹›«»【】〈〉《》「」『』"
   :with stack = '()
   :for ch = (read-char *standard-input* nil nil)
   :for pos = (and ch (position ch parentheses))
   :while ch
   :do (cond
         ((null pos))
         ((evenp pos)
          (push ch stack))
         (t
          (if (eql (first stack) (aref parentheses (1- pos)))
              (pop stack)
              (error "Unmatched parenthesis: ~A vs ~A" (first stack) ch))))
   :finally (when stack
              (error "Unmatched parentheses: ~S" stack)))