Parse GFM Extended Autolinks #57

dillonkearns · 2020-10-02T18:33:33Z

We currently handle the CommonMark autolinks, which are links with explicit surrounding <>'s.

However, we are not parsing the GitHub-Flavored Markdown's extended autolinks, which are bare links with no explicit token. The fact that it should be parsed as a link is inferred by the format, for example content starting with https:// and followed by a valid domain.

You can see the current end-to-end spec failures here:

https://github.com/dillonkearns/elm-markdown/blob/master/test-results/failing/GFM/%5Bextension%5D%20Autolinks.md

This issue will be complete when we've made those end-to-end tests pass.

Existing Inline Parsing Code

Note that the inline parsing code does not using elm/parser because Markdown inline parsing using a very different algorithm than the block parsing, and it's not well-suited to elm/parser. The details of why are not important in this issue, but it's worth being aware that this code is based on Regex processing.

Here's the current area where CommonMark-style autolinks are handled:

elm-markdown/src/Markdown/InlineParser.elm

Lines 1101 to 1107 in 40f9dc4

    
           angleBracketsToMatch : Token -> Escaped -> List Match -> References -> String -> ( Token, List Token, List Token ) -> Maybe ( List Token, List Match ) 
        
           angleBracketsToMatch closeToken escaped matches references rawText ( openToken, _, remainTokens ) = 
        
               let 
        
                   result = 
        
                       tokenPairToMatch references rawText (\s -> s) CodeType openToken closeToken [] 
        
                           |> autolinkToMatch 
        
                           |> ifError emailAutolinkTypeToMatch

Note that it is only applying this in the context of angleBracketsToMatch. We can likely reuse some of the autolinkToMatch code, but outside of the context of an angle brackets match.

The text was updated successfully, but these errors were encountered:

stephenreddek · 2020-10-20T20:31:20Z

@dillonkearns I'll work on this one if it's up for grabs!

dillonkearns · 2020-10-20T20:32:47Z

It's all yours, thank you @stephenreddek! 👌 💯

stephenreddek · 2020-10-21T15:25:35Z

@dillonkearns What are your thoughts on how to handle multiple trailing "entity references" per https://github.github.com/gfm/#example-626 ? It only explicitly mentions handling a single, trailing reference, but it sure feels like it should remove multiple of them if they exist.

stephenreddek · 2020-10-21T15:35:42Z

One piece of supporting evidence for the idea of trimming all: the parentheses rule removes all trailing unmatched parentheses.

stephenreddek · 2020-10-21T16:05:16Z

Another question! The spec for url autolinks only mentions support he protocols http and https, but there's a test that also shows supporting ftp. It's easy enough to add ftp specifically, but I'm unsure how to handle this seemingly conflicting information. Should it support only those 2 or 3? Should it support anything that looks like a protocol?

Thanks for any guidance you have!

dillonkearns · 2020-10-28T18:42:29Z

Hey @stephenreddek!

Good questions. So for the URL schemes, my thinking is that it should either 1) be very specific (only the explicit ones mentioned, http and https), or 2) be completely general (anything in the form of scheme://).

I don't like the idea of hardcoding a specific set when there are so many possible schemes: https://en.wikipedia.org/wiki/List_of_URI_schemes. And indeed, many different possible valid URLs.

Babelmark tends to treat general schemes, like a slack:// link, as plain text (not autolinks):

https://babelmark.github.io/?text=This+is+a+slack+https+link%3A+https%3A%2F%2Fslack.com%2Fapp_redirect%3Fapp%3DA1BES823B%0A%0AThis+is+a+direct+slack+link%3A+slack%3A%2F%2Fopen%3Fteam%3Dmy-team%0A

So let's go with option (1) on this, and only handle the specific cases of http and https 👍

dillonkearns · 2020-10-28T18:47:45Z

Regarding trailing entity references, that seems right to me that we should remove multiple references. What happens on babelmark? I often let that by the tie breaker when I'm not sure, with a little extra weight given to the results from the official C implementation for the GitHub Flavored Markdown engine's results.

stephenreddek · 2020-10-30T22:21:00Z

Yep, the official implementation drops them all so I'll just go with that!

dillonkearns added help wanted Extra attention is needed hacktoberfest labels Oct 2, 2020

stephenreddek mentioned this issue Oct 23, 2020

Parse GFM Extended autolinks #71

Draft

LutSa mentioned this issue Apr 12, 2021

Student Proposal: Improve pure Elm markdown parser elm-tooling/gsoc-projects#17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse GFM Extended Autolinks #57

Parse GFM Extended Autolinks #57

dillonkearns commented Oct 2, 2020

stephenreddek commented Oct 20, 2020

dillonkearns commented Oct 20, 2020

stephenreddek commented Oct 21, 2020

stephenreddek commented Oct 21, 2020

stephenreddek commented Oct 21, 2020

dillonkearns commented Oct 28, 2020

dillonkearns commented Oct 28, 2020

stephenreddek commented Oct 30, 2020

Parse GFM Extended Autolinks #57

Parse GFM Extended Autolinks #57

Comments

dillonkearns commented Oct 2, 2020

Existing Inline Parsing Code

stephenreddek commented Oct 20, 2020

dillonkearns commented Oct 20, 2020

stephenreddek commented Oct 21, 2020

stephenreddek commented Oct 21, 2020

stephenreddek commented Oct 21, 2020

dillonkearns commented Oct 28, 2020

dillonkearns commented Oct 28, 2020

stephenreddek commented Oct 30, 2020