Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect regex translation #32

Open
agentzh opened this issue Mar 27, 2015 · 6 comments
Open

Incorrect regex translation #32

agentzh opened this issue Mar 27, 2015 · 6 comments

Comments

@agentzh
Copy link
Contributor

agentzh commented Mar 27, 2015

Hi Ingy!

I ran into a bug in the Pegex grammar translator that took me a while to figure out. Basically the following Pegex regex

    / 'a' | 'b' /

is translated into the following Perl regex:

    qr/\Ga|b/

while the expected Perl regex should be

qr/\G(?:a|b)/

Adding (: ...) to the original Pegex regex worked around this bug.

@ingydotnet
Copy link
Collaborator

@agentzh I don't see this as a bug. Pegex cannot know when to add parens in a regex.

/ (: 'a' | 'b' ) /

or:

/ (:a|b) /

Would be the proper way to do it.

https://github.com/ingydotnet/pegex-pgx/blob/master/pegex.pgx#L53 is the regex parsing rule.

It just splits up stuff between 2 / into ws, strings, refs and raw. It ignores whitespace, escapes strings, expands refs and leaves raw intact. Then it joins them together and slaps a \G in front (for Perl5).

Maybe a doc patch is in order somewhere?

@agentzh
Copy link
Contributor Author

agentzh commented Mar 27, 2015

@ingydotnet I can understand the implementation complications here but mandating a (: ...) on the user side looks counter-intuitive and error-prone. The simplest / 'a' | 'b' / form becomes a pitfall that is tricky to debug. Alas. IMHO I hope Pegex could do better here instead of patching the documentation :)

@ingydotnet
Copy link
Collaborator

The problem here is scope of concern. I'd have to special case that. And how does that compare to

/ 'a' x 'b' /

To pegex the | is just a raw string. Pegex doesn't try to understand the regex meaning.

Also these 2 are equal:

/ 'a' | 'b' /
/a|b/

Now what could work is this:

'a' | 'b'

The pipe is Pegex syntax here. It's between 2 regexes that can be optimized into:

/(?:a|b)/

The optimizer would know that it must add parens.

Maybe we can add degugging that shows the regex or something.

I'm happy to make Pegex more user friendly, but (unless I don't understand what you are wanting) I see this as leading to a world of special cases.

@agentzh
Copy link
Contributor Author

agentzh commented Mar 27, 2015

@ingydotnet Yeah, it requires more work on the regex translator to get exactly right. I'll try living with it for now :) Not a big deal for me.

@ingydotnet
Copy link
Collaborator

The optimizer should be made to handle this correctly (I now believe).

@agentzh
Copy link
Contributor Author

agentzh commented Apr 1, 2015

@ingydotnet Yay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants