Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quadratic behavior when parsing smart quotes #521

Open
nwellnhof opened this issue Jan 15, 2024 · 4 comments
Open

Quadratic behavior when parsing smart quotes #521

nwellnhof opened this issue Jan 15, 2024 · 4 comments

Comments

@nwellnhof
Copy link
Contributor

Reproduce with:

python3 -c 'print("\x27x " * 30000 + "\x27\"x" * 30000)' |\
    time build/src/cmark --smart >/dev/null

After unquoting, this input looks like

'x 'x 'x 'x 'x '"x'"x'"x'"x'"x

This also works with emphasis:

python3 -c 'print("\"x " * 30000 + "\"_x" * 30000)' |\
    time build/src/cmark --smart >/dev/null

When matching a pair of smart quotes, we currently don't remove delimiters (emphasis or other smart quotes) between the quotes. This can lead to quadratic behavior when processing deeply nested quotes interspersed with emphasis or different quotes.

A simple way to fix this issue is to delete delimiters between smart quotes but this changes the parsing rules. Another, more complicated approach is to handle single quotes, double quotes and emphasis in separate lists.

This is the only remaining issue with quadratic complexity in the parser that I could find with a specialized fuzzer similar to the "quadratic" fuzzers in cmark-gfm. Note that there are still some issues in the serializers.

@jgm
Copy link
Member

jgm commented Jan 15, 2024

I think removing delims between the quotes makes sense (not sure why I didn't do that originally?).

@nwellnhof
Copy link
Contributor Author

Here's an example how removing delimiters between quotes changes parsing results:

Before:

% echo "'a _b' c_ d" |build/src/cmark --smart
<p>‘a <em>b’ c</em> d</p>

After (the underscore between quotes will be ignored):

% echo "'a _b' c_ d" |build/src/cmark --smart
<p>‘a _b’ c_ d</p>

I also noticed that smart quotes are always converted even if no matching pair is found.

@jgm
Copy link
Member

jgm commented Jan 15, 2024

I guess the question is whether we want to treat smart quotes as at the same precedence level as _ and *, or as having a lower precedence level.

Since it's an extension, this isn't decided one way or the other by the spec.

I suppose one could argue that spec'd behavior should take precedence, so that smart quotes should have a lower precedence level. Not sure how strong that is.

In any case, these are corner cases; it's hard to imagine this would affect anything serious.

@notriddle
Copy link
Contributor

Other than the commonmark-hs smart quotes extension, most implementations don't do that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@jgm @nwellnhof @notriddle and others