Skip to content

Latest commit

 

History

History
195 lines (129 loc) · 5.01 KB

README.md

File metadata and controls

195 lines (129 loc) · 5.01 KB

parsed

Simple PEG parser builder written on Python.

There are many parser generators for python but each of 'em has IMO some disadvantages: e.g. not suitable for stream processing, or does not allow to track parsing and understand where it was failed, or does not support mode w/o skipping whitespaces, or has ugly rules syntax, or does not allow to attach semantic action in-place...

"Parsed" is inspired by "arpeggio" (http://arpeggio.googlecode.com/). This parser generator is interesting and there was the chance I'll start to use it but... maybe I am too lazy to study the code and modify it to correspond my needs or, most probably, it is interesting to write own parser... so I wrote One More parser generator.

Writing new grammar parsers

To create parser and use predefined rules one need to import parsed package.

    from parsed import *

Parser generator function accepts no parameters and is decorated by @rule, it should return a rule, e.g.:

    @rule
    def abc(): return char('abc') > value

As expected result of calling a generator function is a parser. Parser gets its name from generator function name. This is why lambdas etc. are not used. To correctly parse iterable source one should wrap it with the source() function.

    parser = abc()
    src = 'a'
    position, result = parser.parse(src)
    if result != nomatch:
        print "MATCH", result, "@", position
    else:
        print "NO MATCH"

Rules

Character matching rule

  • against single char

      #single char, ignore by default
      @rule
      def is_A(): return char('A')
    
  • against char from iterable

      #matches against LF or CR, on match by default returns
      #matched character
      @rule
      def vspace(): return char('\r\n')
    
  • against predicate

      import string
    
      #boolean predicate to test a char for some condition
      def __is_punct(c): return c in string.punctuation
    
      #any char matching predicate, return a value by default
      @rule
      def is_punctuation(): return char(__is_punct)
    

Sequence

Matching sequence of rules, using operator '+':

    #hash symbol('#'), followed by any 'abcABC'
    @rule
    def hashed_abc(): return char('#') + 'abcABC'

Choice

Short circuiting 'OR':

    #hashed_abc from the example above OR any char from 'abc' set
    @rule
    def hashed_abc_or_abc(): return hashed_abc | 'abc'

Negation

    #any character except 'a', does not consume input
    @rule
    def not_a(): return ~char('a')

Repetition

    #character 'a' repeated 1 or more times
    @rule
    def one_or_more_a(): return char('a')[1:]

    #character 'a' repeated 0 or more times
    @rule
    def zero_or_more_a(): return char('a')[0:]

    #character 'a' or its absence
    @rule
    def maybe_a(): return char('a')[0:1]

Lookahead

Lookahead can be expressed or negating (adding prefix '-' operator) a rule or by appending rule to sequence using bitwise AND ('&') operator.

    #character 'a', matches only if followed by any character from 'abc'
    #set, does not consume input
    @rule
    def a_before_abc(): return char('a') + -char('abc')

    #the same as above, in this case lookahead match is always
    #excluded from parsing results
    @rule
    def also_a_before_abc(): return char('a') & char('abc')

Parsing (semantic) action

    #extract a list of characters from double quoted string
    #consisting from 'abc' characters
    @rule
    def dquoted_abc(): return '"' + char('abc')[1:] + '"' > first


    #compose 2 characters in string like "{CHAR1}&{CHAR2}" if first is
    #from 'abc' set and second - from 'def'. lambda should be enclosed in
    #braces because it has lowest precendence
    @rule
    def abc_def(): return char('abc') + 'def' > (lambda x: x[0] + '&' + x[1])

    #lookahead result can be also included in parsing results if
    #lookahead declared using prefix '-' but not binary '&'
    @rule
    def a_before_abc(): return char('a') + (-char('abc') > value)

Examples

All examples above can be run and tested by running examples/readmy.py

Predefined rules

TODO

Predefined constants

    nomatch #means rule is not matched

    empty #end of input (eof) or ignored result

TODO

Predefined actions

    list2str #string from list

    first, second #first/second element of list

    nth(N) #returns action extracting Nth list element

    ignore #return empty constant

TODO

Parsing options

  • is_trace: if it is set to True, parsing process will be traced into stderr

  • is_remember: if True --- use memoization

  • use_unicode

  • is_stat: gather matching statistics (now only hits and miss count for each rule)

What's next?

  • add option to ignore whitespaces

  • add Abstract Syntax Tree visualization

  • parser optimization

  • more unit tests to avoid regressions

  • ?..