Programing Exercise, Validate Matching Brackets
This is a preliminary report on scripts of several languages to validate matching brackets.
Programing Challenge: Matching Pairs Validation
The problem is to write a script that can check a dir of text files (and all subdirs) and reports if a file has any mismatched matching brackets.
- The files will be utf-8 encoded, with unix style line ending.
- If a file has mismatched matching-pairs, the script will display the file name, and the line number and column number of the first or last instance where a mismatched bracket occures. (or, just the char position (as in emacs's “point”)) Exactly which position is considered as the “first” or “last” doesn't matter much, as long as it report a char that breaks the nesting matching pair syntax.
- The matching pairs are these and nothing else: () {} [] “” ‹› «» 【】 〈〉 《》 「」 『』 . Note that ‘single curly quote’ is not consider matching pair here.
- You script must be standalone. Must not be using some parser tools. But can call lib that's part of standard distribution in your lang.
Here is a example of mismatched bracket: ([)]
, (“[”)
, ((
, 】
etc. (and yes, the brackets may be nested. There are usually text between these chars.)
I'll be writing a emacs lisp solution and post in 2 days. Ι welcome other lang implementations. In particular, perl, python, php, ruby, tcl, lua, Haskell, Ocaml. I'll also be able to eval common lisp (clisp) and Scheme lisp (scsh), Java. Other lang such as Clojure, Scala, C, C++, or any others, are all welcome, but i won't be able to eval it. JavaScript implementation will be very interesting too, but please indicate which and where to install the command line version.
I hope you'll find this a interesting “challenge”. This is a parsing problem. I haven't studied parsers except some Wikipedia reading, so my solution will probably be naive. I hope to see and learn from your solution too.
i hope you'll participate. Just post solution here. Thanks.
Solutions
Emacs Lisp
Emacs Lisp: Batch Script to Validate Matching Brackets .
Jon Snader (jcs) wrote 2 versions in elisp. See: [Xah's Challenge (Part 2) By Jon Snader. At http://irreal.org/blog/?p=165 , accessed on 2011-07-23 ]. Ι haven't studied it yet.
Python
# python 3 # spec at http://xahlee.org/comp/validate_matching_brackets.html # 2011-07-17 # by Raymond Hettinger input_dir = "/Users/xah/web/xahlee_info/comp/" import os, re def check_balance(characters): '''Return -1 if all delimiters are balanced or the char number of the first delimiter mismatch. ''' openers = { '(': ')', '{': '}', '[': ']', '“': '”', '‹': '›', '«': '»', '【': '】', '〈': '〉', '《': '》', '「': '」', '『': '』', } closers = set(openers.values()) stack = [] for i, c in enumerate(characters, start=1): if c in openers: stack.append(openers[c]) elif c in closers: if not stack or c != stack.pop(): return i if stack: return i return -1 def scan(directory, encoding='utf-8'): for dirpath, dirnames, filenames in os.walk(directory): for filename in filenames: fullname = os.path.join(dirpath, filename) if re.search(r'\.txt$', filename, re.U) and os.path.isfile(fullname): print ( "processing:" + fullname) with open(fullname, 'r', encoding=encoding) as f: try: characters = f.read() except UnicodeDecodeError: continue position = check_balance(characters) if position >= 0: print('{0!r}: {1}'.format(position, fullname)) scan(input_dir) print ("done")
Original post at [a little parsing challenge ☺ At http://groups.google.com/group/comp.lang.python/browse_frm/thread/c52aee1e4392436e/]
Thanks to the many who have written code and made helpful comments. I may come back to clean this up, in the coming weeks. If you can correct one of the following programs, please comment.
Pending Solutions
Python
This one, so far didn't work, possibly unix/Windows issues.
# -*- coding: utf-8 -*- # python 3 # spec at http://xahlee.org/comp/validate_matching_brackets.html # 2011-07-18 # by Thomas Jollans # https://gist.github.com/1087682 # https://gist.github.com/1087682/2240a0834463d490c29ed0f794ad15128849ff8e """ Python 3 solution to Xah Lee's infamous crosspost of July 17 2011. License; Chicken Dance License version 0.2 or newer. For the license and dance, see https://github.com/supertunaman/cdl """ input_dir = "c:/Users/h3/web/xahlee_org/p/time_machine" import os import os.path from concurrent.futures import ProcessPoolExecutor from functools import partial from collections import deque openers = { '(': ')', '{': '}', '[': ']', '“': '”', '‹': '›', '«': '»', '【': '】', '〈': '〉', '《': '》', '「': '」', '『': '』', } brackets = set(openers.keys()) | set(openers.values()) def checkbracket(arg): col, char = arg if char in brackets: return col, char def getbrackets(line): return [x for x in map(checkbracket, enumerate(line)) if x is not None] def process_file(filename): f = open(filename, 'r', encoding='utf-8') stack = deque() for line_no, line in enumerate(map(getbrackets, f)): for col, bracket in line: if bracket in openers: stack.append((bracket, line_no, col)) else: #closer in_, l, c = stack.pop() if bracket != openers[in_]: print('{0}:{1}:{2}: expecting {3} found {4}' .format(filename, line_no, col, openers[in_], bracket)) return False if stack: in_, l, c = stack.pop() print('{0}:{1}:{2}: expecting {3} found {4}' .format(filename, line_no, col, openers[in_], 'EOF')) return False else: return True def main(): with ProcessPoolExecutor() as ex: for dirpath, dirnames, filenames in os.walk(input_dir): for filename in filenames: fullname = os.path.join(dirpath, filename) print (fullname) ex.submit(process_file, fullname) if __name__ == '__main__': main()
doesn't do Unicode and doesn't report error position.
# -*- coding: utf-8 -*- # python 2.6 # spec at http://xahlee.org/comp/validate_matching_brackets.html # 2011-07-18 # by Billy Mays # original at http://groups.google.com/group/comp.lang.python/msg/4bfb61c3346ecad5 input_dir = "c:/Users/h3/web/xahlee_org/p/time_machine" import sys, os pairs = { '}':'{', ')':'(', ']':'[', '>':'<', '”':'“', '›':'‹', } valid = set( v for pair in pairs.items() for v in pair ) for dirpath, dirnames, filenames in os.walk(input_dir): for name in filenames: stack = [' '] with open(os.path.join(dirpath, name), 'rb') as f: chars = (c for line in f for c in line if c in valid) for c in chars: if c in pairs and stack[-1] == pairs[c]: stack.pop() else: stack.append(c) print ("Good" if len(stack) == 1 else "Bad") + ': %s' % name
doesn't seem to be correct. No report on bad file.
# -*- coding: utf-8 -*- # spec at http://xahlee.org/comp/validate_matching_brackets.html ''' Created on 2011-07-18 @author: Thomas 'PointedEars' Lahn <PointedE...@web.de>, based on an idea of Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9...@myhashismyemail.com> in <news:j01ph6$knt$1@speranza.aioe.org> http://groups.google.com/group/comp.lang.python/msg/7f214bd222650c1b ''' input_dir = "c:/Users/h3/web/xahlee_org/p/time_machine" import sys, os pairs = { u'}': u'{', u')': u'(', u']': u'[', u'”': u'“', u'›': u'‹', u'»': u'«', u'】': u'【', u'〉': u'〈', u'》': u'《', u'」': u'「', u'』': u'『'} valid = set(v for pair in pairs.items() for v in pair) if __name__ == '__main__': for dirpath, dirnames, filenames in os.walk(input_dir): for name in filenames: stack = [' '] # you can use chardet etc. instead encoding = 'utf-8' with open(os.path.join(dirpath, name), 'r') as f: reported = False chars = ((c, line_no, col) for line_no, line in enumerate(f) for col, c in enumerate(line.decode(encoding)) if c in valid) for c, line_no, col in chars: if c in pairs: if stack[-1] == pairs[c]: stack.pop() else: if not reported: first_bad = (c, line_no + 1, col + 1) reported = True else: stack.append(c) if len(stack) != 1: print '%s: %s' % (name, ("bad '%s' at %s:%s" % first_bad))
Ruby
#!/usr/bin/env ruby19 # -*- coding: utf-8 -*- # spec at http://xahlee.org/comp/validate_matching_brackets.html # 2011-07-18 by Robert Klemme # https://gist.github.com/1087583 # http://groups.google.com/group/comp.lang.lisp/msg/6f3f518e6782c3c9 # Approach with tr in order to avoid too much backtracking # in the regexp and keep runtime realistic. # my approach summarized # - traverse a directory tree # - for each found item of type "file" # - read the whole content # - throw it at a regexp which is anchored at the beginning # and does the recursive parsing # - report file if the match is shorter than the file # Note: special feature for recursive matching is used which Perl's regexp # engine likely can do as well but many others don't. require 'find' ARGV.each do |dir| Find.find dir do |f| next unless test ?f, f content = File.read(f, :encoding => 'UTF-8') md = %r{ \A (?<tl> [^(){}\[\]“”‹›«»【】〈〉《》「」『』]++ | \( \g<tl>* \) | \{ \g<tl>* \} | \[ \g<tl>* \] | “ \g<tl>* ” | ‹ \g<tl>* › | « \g<tl>* » | 【 \g<tl>* 】 | 〈 \g<tl>* 〉 | 《 \g<tl>* 》 | 「 \g<tl>* 」 | 『 \g<tl>* 』 )* }x.match content len = md[0].length printf "file %s, pos %d\n", f, len if len < content.length end end
Incomplete. Doesn't read dir. Doesn't do Unicode.
# -*- coding: utf-8 -*- # Ruby # # spec at http://xahlee.org/comp/validate_matching_brackets.html # 2011-07-18 # by “WJ” (w_a_x…@yahoo.com) fences = '(){}[]<>' fence_pat = Regexp.new( "[" + Regexp.escape( fences ) + "]" ) stack = [] gets( nil ).scan( fence_pat ){|char| ((p = fences.index( char)).even? and stack.push char) or (o = stack.pop raise "Unmatched: #{o} vs. #{char}" unless o == fences[p-1,1])} puts "Unmatched:", stack unless stack.empty?
Perl
# perl # spec at http://xahlee.org/comp/validate_matching_brackets.html # 2011-07-18 # by Rouslan Korneychuk # http://groups.google.com/group/comp.lang.python/msg/e5aa0217bd1bd587 use strict; use utf8; binmode STDOUT,":utf8"; use File::Find; my $checkre = qr/(?| (\()(?&matched)([\}\]”›»】〉》」』]|$) | (\{)(?&matched)([\)\]”›»】〉》」』]|$) | (\[)(?&matched)([\)\}”›»】〉》」』]|$) | (“)(?&matched)([\)\}\]›»】〉》」』]|$) | (‹)(?&matched)([\)\}\]”»】〉》」』]|$) | («)(?&matched)([\)\}\]”›】〉》」』]|$) | (【)(?&matched)([\)\}\]”›»〉》」』]|$) | (〈)(?&matched)([\)\}\]”›»】》」』]|$) | (《)(?&matched)([\)\}\]”›»】〉」』]|$) | (「)(?&matched)([\)\}\]”›»】〉》』]|$) | (『)(?&matched)([\)\}\]”›»】〉》」]|$) | (^)(?&matched)([\)\}\]”›»】〉》」』])) (?(DEFINE)(?<matched>(?: \((?&matched)\) | \{(?&matched)\} | \[(?&matched)\] | “(?&matched)” | ‹(?&matched)› | «(?&matched)» | 【(?&matched)】 | 〈(?&matched)〉 | 《(?&matched)》 | 「(?&matched)」 | 『(?&matched)』 | [^\(\{\[“‹«【〈《「『\)\}\]”›»】〉》」』]++)*+)) /sx; sub check_file { if(-f && /\.txt$/) { if(open(my $fh,'<:encoding(UTF-8)',$_)) { undef $/; my $data = <$fh>; if($data =~ $checkre) { if(!length $2) { print "File $File::Find::name has unclosed bracket $1 at position $-[1]\n"; } elsif(!length $1) { print "File $File::Find::name has unopened bracket $2 at position $-[2]\n"; } else { print "File $File::Find::name has mismatched brackets $1 $2 at positions $-[1] and $-[2]\n"; } } } else { print STDERR "Cannot open $File::Find::name: $!\n"; } } } @ARGV = ('.') unless @ARGV; find(\&check_file,@ARGV); __END__ [on the regex part] If the pattern matches, there is a mismatched bracket. $1 is set to the mismatched opening bracket. $-[1] is its location. $2 is the mismatched closing bracket or '' if the bracket was never closed. $-[2] is set to the location of the closing bracket or the end of the string if the bracket wasn't closed. I didn't write all that manually; it was generated with this: my @open = ('\(','\{','\[','“','‹','«','【','〈','《','「','『'); my @close = ('\)','\}','\]','”','›','»','】','〉','》','」','』'); '(?|'.join('|',map {'('.$open[$_].')(?&matched)(['.join('',@close[0..($_-1),($_+1)..$#close]). ']|$)'} (0 .. $#open)).')(?(DEFINE)(?<matched>(?:'.join('|',map {$open[$_].'(?&matched)'.$close[$_]} (0 .. $#open)).'|[^'.join('',@open,@close).']++)*+))'
Common Lisp
;; common lisp ;; spec at http://xahlee.org/comp/validate_matching_brackets.html ;; 2011-07-17 ;; By Madhu <enom...@meer.net> ;; http://groups.google.com/group/comp.lang.lisp/msg/8cdb2b6b55c9af93 ;; Just for LOLS (and just using the Common Lisp reader), using CCL as a ;; unicode-capable lisp with the :UTF-8 external-format, ;; i.e. "ccl -K :UTF-8": (defvar +matching-pair-string+ (delete #\Space "() {} [] “” ‹› «» 【】 〖〗 〈〉 《》 「」 『』")) (defvar *matching-delim* nil) (defvar *offending-position* nil) (defun delim-reader (stream char &aux (i (position char +matching-pair-string+))) (assert (evenp i) nil "Unexpected ~C~@[ at file-position ~D~].~@[ Wanted ~C.~]~@[ Start ~D.~]~&" char (ignore-errors (file-position stream)) *matching-delim* *offending-position*) (let ((*matching-delim* (elt +matching-pair-string+ (1+ i))) (*offending-position* (ignore-errors (file-position stream)))) (read-delimited-list *matching-delim* stream t) nil)) (defun lee-readtable (&optional (readtable (copy-readtable nil))) (loop for c across ;; 2.1.3 standard characters (overkill, but ;; should include all macro characters...) "aAbBcCdDeEfFgGhHiIjJkKlLmM12345nNoOpPqQrRsStTuUvVwWxXyYzZ1234567890!$\"'() ,_-./:;?+<=>#%&*@[\\]{|}`^~" do (set-macro-character c nil t readtable) (set-syntax-from-char c #\Space readtable)) (loop for c across +matching-pair-string+ do (set-macro-character c 'delim-reader nil readtable)) readtable) (defvar $lee-package (make-package "LEEDUMMY")) (defparameter $lee-readtable (lee-readtable)) (defun lee-read-file (pathname &aux (*readtable* $lee-readtable) (*package* $lee-package)) (with-open-file (stream pathname :external-format :utf-8) (read stream))) #+nil (with-open-file (stream "/tmp/1" :direction :output :if-exists :supersede :external-format :utf-8) (write-string "([)], (“[[”), ((, 】" stream)) #+nil (lee-read-file "/tmp/1") ;; > Error: Unexpected ) at file-position 3. Wanted ]. Start 2.
#!/usr/bin/clisp -ansi ;; spec at http://xahlee.org/comp/validate_matching_brackets.html ;; 2011-07-18 ;; Pascal J Bourguignon ;; http://groups.google.com/group/comp.lang.lisp/msg/c0989e261d6c0ea0 ;; using ccl http://www.clozure.com/clozurecl.html ;; You only need a trivial push-down automata: (loop :with parentheses = "(){}[]“”‹›«»【】〈〉《》「」『』" :with stack = '() :for ch = (read-char *standard-input* nil nil) :for pos = (and ch (position ch parentheses)) :while ch :do (cond ((null pos)) ((evenp pos) (push ch stack)) (t (if (eql (first stack) (aref parentheses (1- pos))) (pop stack) (error "Unmatched parenthesis: ~A vs ~A" (first stack) ch)))) :finally (when stack (error "Unmatched parentheses: ~S" stack)))
- Unicode Support in Programing Language Function Name and Operator
- One Language to Rule Them All?
- Python, Lambda, Guido: is Language Design Just Solving Puzzles?
- What Are Good Qualities of Computer Language Syntax?
- Programing Language Design: String Syntax
- Guy Steele Says: Don't Iterate, Recurse, and Get rid of lisp cons!
- What is List Comprehension and Why is it Harmful?
- What is Point-free Programing? (point-free function syntax)
- Programing Problem: Normalize a Vector of Any Dimension
- In-place Algorithm, Reverse List in JavaScript, Python, Perl, Lisp, Wolfram Lang
- Programing Problem: Construct a Tree Given Its Edges
- Programing: Decimalize Latitude Longitude
- One Language to Rule Them All? Or, What Language to Use for Find Replace?
- Programing Challenge: Replace String Pairs