Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code blocks buggy in markdown renderer #3592

Open
jmsmkn opened this issue Oct 8, 2024 · 11 comments · May be fixed by #3764
Open

Code blocks buggy in markdown renderer #3592

jmsmkn opened this issue Oct 8, 2024 · 11 comments · May be fixed by #3764
Assignees

Comments

@jmsmkn
Copy link
Member

jmsmkn commented Oct 8, 2024

MarkDown such as:

Screenshot 2024-10-08 at 14 24 24

Should render as:

shouldn't be a code block

first_line = 1
for foo in bar:
  print(foo)

But actual result is:

Screenshot 2024-10-08 at 14 24 49

Note:

  • The first line should not be a code block
  • The first line of the python block for some reason has a space
  • Multiple blank lines at the start of a MarkDown block should not result in a code block
@chrisvanrun
Copy link
Contributor

+1

Wanted to report this:
image

But it's captured by the above issue post.

@chrisvanrun
Copy link
Contributor

Using this:


This line shouln't be a code block

A single space should not be a code block

This line should be a code block: two spaces in front

Now for the fenced style of coding, using ```:

first_line = 1
for foo in bar:
  print(foo)

And another one:

# Typed with Python
first_line = 1
for foo in bar:
  print(foo)

Using ~x3 also works:

first_line = 1
for foo in bar:
  print(foo)

How about we manually introduce an inline code HTML tag:

This should work: Hello world

Inline, using enapsulating single back-ticks (`): var and foo=True


Currently result in:

image

@chrisvanrun
Copy link
Contributor

The first line should not be a code block

Possibly it had 2 spaces? I think I prefer the explicit fenced code for code. So that should go.

Multiple blank lines at the start of a MarkDown block should not result in a code block

Could not reproduce this.

@ammar257ammar
Copy link
Contributor

ammar257ammar commented Dec 18, 2024

I checked this one too. Only from two spaces on the text becomes a codeblock

Screenshot 2024-12-18 at 11 52 46

@chrisvanrun
Copy link
Contributor

The cause of these is we have 3 extensions in place to handle code in markdown -> html:

  • BS4Treeprocessor which adds "codehilite" class to any <code> tags. Not always correct since the inline use of explicit <code> seems broken.
  • markdown.extensions.codehilite
  • markdown.extensions.fenced_code

They interact, sometimes weirdly.

@chrisvanrun
Copy link
Contributor

chrisvanrun commented Dec 19, 2024

Quick update.

Guh, quite the rabbit hole. Trying to fix the last blemish on the <code> rendering:

image

The BS4TreeProcessor is messing up some HTML snippets that the markdown core generates. For instance, the HTMLExtractor parses "<code>Hello</code>" as these two snippets: <code> and </code>. The BeautifulSoup parser, used internal by the BS4TreeProcessor handles incomplete HTML as input but outputs the fully corrected versions. Post cleaning then removes any orphaned end-tags (I think?). So <code class="codehilite"></code>Hello is the final result.

Figuring out why the HTMLExtractor decides that it's two seperate snippets is one solution. However, it's based on a monkey-patched HTML parser and hence quite complex. Marrying MD and HTML in a single parse tree is food for overly complex things, so I like the alternative better:

Add a 'final' pure-HTML parser postprocessor that adds the BS4 classes. It's quite simple to setup, with some added whistles for making things 'safe':

class ExtendTagClasses:
    def __init__(self, tag_classes):
        self.tag_class_dict = tag_classes

    def __call__(self, html):
        input_is_safe = isinstance(html, SafeString)

        soup = BeautifulSoup(html)
        for tag, classes in self.tag_class_dict.items():

            # Make extensions safe
            classes = [escape(c).strip() for c in classes]

            # Add extension to the class attribute
            for element in soup.find_all(tag):
                current_classes = element.get("class", [])
                element["class"] = [*current_classes, *classes]

        new_html = str(soup)

        if input_is_safe:
            new_html = mark_safe(new_html)

        return new_html

It's a nice construct but has the side-effect that the obfuscated email links (i.e. <a href="&#109;&#97;&#105;&#108;&#116;&#111;&#58;&#102;&#97;&#107;&#101;&#46;&#101;&#109;&#97;&#105;&#108;&#64;&#101;&#109;&#97;&#105;&#108;&#46;&#99;&#111;&#109;">&#102;&#97;&#107;&#101;&#46;&#101;&#109;&#97;&#105;&#108;&#64;&#101;&#109;&#97;&#105;&#108;&#46;&#99;&#111;&#109;</a>. are being de-obfuscated because of the way BeautifulSoup default parser reads the HTML source.

Not sure if at this point it is better to:

  • Add a small email obf post-processor.
  • Spent time trying to get the original TreeProcessor to work.

@chrisvanrun
Copy link
Contributor

Do note: the email's are being de-obfuscated by the original TreeParser already if it happens to be within a targeted add-class tag!

@jmsmkn
Copy link
Member Author

jmsmkn commented Dec 19, 2024

Err...

        return mark_safe(new_html)

@chrisvanrun
Copy link
Contributor

chrisvanrun commented Dec 19, 2024

Err...

        return mark_safe(new_html)

Work in progress! It will have tests but indeed: that should not go there!

@jmsmkn
Copy link
Member Author

jmsmkn commented Dec 19, 2024

You can remove a for loop I think:

        for element in soup.find_all([*self.tag_class_dict.keys()]):
            current_classes = element.get("class", [])
            element["class"] = [*current_classes, *self.tag_class_dict[element.name]]

@chrisvanrun chrisvanrun linked a pull request Dec 20, 2024 that will close this issue
@chrisvanrun chrisvanrun linked a pull request Dec 20, 2024 that will close this issue
@chrisvanrun
Copy link
Contributor

chrisvanrun commented Dec 20, 2024

After serveral different appoaches I had the hunch of pre-escaping the character/entity references. Working PR is out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants