Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse GFM Extended Autolinks #57

Open
dillonkearns opened this issue Oct 2, 2020 · 8 comments
Open

Parse GFM Extended Autolinks #57

dillonkearns opened this issue Oct 2, 2020 · 8 comments
Labels
hacktoberfest help wanted Extra attention is needed

Comments

@dillonkearns
Copy link
Owner

We currently handle the CommonMark autolinks, which are links with explicit surrounding <>'s.

However, we are not parsing the GitHub-Flavored Markdown's extended autolinks, which are bare links with no explicit token. The fact that it should be parsed as a link is inferred by the format, for example content starting with https:// and followed by a valid domain.

You can see the current end-to-end spec failures here:

https://github.com/dillonkearns/elm-markdown/blob/master/test-results/failing/GFM/%5Bextension%5D%20Autolinks.md

This issue will be complete when we've made those end-to-end tests pass.

Existing Inline Parsing Code

Note that the inline parsing code does not using elm/parser because Markdown inline parsing using a very different algorithm than the block parsing, and it's not well-suited to elm/parser. The details of why are not important in this issue, but it's worth being aware that this code is based on Regex processing.

Here's the current area where CommonMark-style autolinks are handled:

angleBracketsToMatch : Token -> Escaped -> List Match -> References -> String -> ( Token, List Token, List Token ) -> Maybe ( List Token, List Match )
angleBracketsToMatch closeToken escaped matches references rawText ( openToken, _, remainTokens ) =
let
result =
tokenPairToMatch references rawText (\s -> s) CodeType openToken closeToken []
|> autolinkToMatch
|> ifError emailAutolinkTypeToMatch

Note that it is only applying this in the context of angleBracketsToMatch. We can likely reuse some of the autolinkToMatch code, but outside of the context of an angle brackets match.

@dillonkearns dillonkearns added help wanted Extra attention is needed hacktoberfest labels Oct 2, 2020
@stephenreddek
Copy link
Contributor

@dillonkearns I'll work on this one if it's up for grabs!

@dillonkearns
Copy link
Owner Author

It's all yours, thank you @stephenreddek! 👌 💯

@stephenreddek
Copy link
Contributor

@dillonkearns What are your thoughts on how to handle multiple trailing "entity references" per https://github.github.com/gfm/#example-626 ? It only explicitly mentions handling a single, trailing reference, but it sure feels like it should remove multiple of them if they exist.

@stephenreddek
Copy link
Contributor

One piece of supporting evidence for the idea of trimming all: the parentheses rule removes all trailing unmatched parentheses.

@stephenreddek
Copy link
Contributor

Another question! The spec for url autolinks only mentions support he protocols http and https, but there's a test that also shows supporting ftp. It's easy enough to add ftp specifically, but I'm unsure how to handle this seemingly conflicting information. Should it support only those 2 or 3? Should it support anything that looks like a protocol?

Thanks for any guidance you have!

@dillonkearns
Copy link
Owner Author

Hey @stephenreddek!

Good questions. So for the URL schemes, my thinking is that it should either 1) be very specific (only the explicit ones mentioned, http and https), or 2) be completely general (anything in the form of scheme://).

I don't like the idea of hardcoding a specific set when there are so many possible schemes: https://en.wikipedia.org/wiki/List_of_URI_schemes. And indeed, many different possible valid URLs.

Babelmark tends to treat general schemes, like a slack:// link, as plain text (not autolinks):

https://babelmark.github.io/?text=This+is+a+slack+https+link%3A+https%3A%2F%2Fslack.com%2Fapp_redirect%3Fapp%3DA1BES823B%0A%0AThis+is+a+direct+slack+link%3A+slack%3A%2F%2Fopen%3Fteam%3Dmy-team%0A

So let's go with option (1) on this, and only handle the specific cases of http and https 👍

@dillonkearns
Copy link
Owner Author

Regarding trailing entity references, that seems right to me that we should remove multiple references. What happens on babelmark? I often let that by the tie breaker when I'm not sure, with a little extra weight given to the results from the official C implementation for the GitHub Flavored Markdown engine's results.

@stephenreddek
Copy link
Contributor

Yep, the official implementation drops them all so I'll just go with that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants