Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WikiText to HTML parser? #11

Closed
rraallvv opened this issue Mar 4, 2018 · 6 comments
Closed

WikiText to HTML parser? #11

rraallvv opened this issue Mar 4, 2018 · 6 comments
Labels

Comments

@rraallvv
Copy link

rraallvv commented Mar 4, 2018

Can MwParserFromScratch be used in a WikiText to HTML parser? Thanks.

@CXuesong
Copy link
Owner

CXuesong commented Mar 5, 2018

Nope. It's beyond the scope of this project. Due to the inherent mess in the syntax of wikitext caused by its history, the AST generated by this parser can only roughly represent the input wikitext. You would need a completely different parsing logic for parsing it into HTML. That is, only by replacing the different markups into HTML over and over again, can you exactly simulate what the MediaWiki parser does.

My little parser is intended to be used by MediaWiki bots to analyze the structure of the wikitext, so something like AST would be handy for that, and a natural way to parse an AST out of the wikitext, is to write a recursive descent parser by hand. However, recursive descent parser works best for context-free grammar, while obviously, wikitext is not. So I did some heavy customization on it. Still it cannot handle the input in a bullet-proof fashion. (E.g. #1, #8)

And another trouble is caused by template expansion (a.k.a. transclusion). Actually the meaning of the same token can vary dramatically. For example, the Test in

{{L}} Test {{R}}

is rendered as a link if Template:L is [[ and Template:R is ]]. But if Template:R is def, then the whole line will be rendered as plain text [[ Test def. If Template:L is {{ and Template:R is }}, then we get yet another template for expanding ({{ Test }}). The point is, we cannot pass in the wikitext once or twice, and generate the final HTML. We need to parse it over and over again, until all the templates have been expanded. Yes we can do that, but even with such pain of parsing, we still cannot simulate exactly what MediaWiki outputs, because of the first problem I've mentioned above.

If you, or whom might be concerned, were to write a wikitext to HTML parser, I would suggest you throw
recursive descent parsing away, and just do what MediaWiki do, i.e. applying regex substitutions over and over again. If you just want to show preview for a MediaWiki code snippets, use MediaWiki render API and let it does the parsing job for you. As an alternative way, you may search for HTML dumps for the WMF projects (though the dumps are rather dated now).

@rraallvv
Copy link
Author

rraallvv commented Mar 6, 2018

@CXuesong , I've been trying to use the MeadiaWiki API to render a small definition page from WikiText, but just when I thought I had something working I decided to try a different language XD

...well, for all the reasons you explained it didn't turn too well.

Thanks for the explanations, it really helped me out to clarify why everybody seems to agree with that one.

@rraallvv rraallvv closed this as completed Mar 6, 2018
@CXuesong
Copy link
Owner

CXuesong commented Mar 6, 2018

Well, in that case, I definitely suggest you to use something like markdown, instead of wikitext…

@rraallvv
Copy link
Author

rraallvv commented Mar 6, 2018

@CXuesong, The problem with trying to render pages from the server side with the WikiMedia API is that those get cluttered with unuseful stuff like for instance the table of contents and links to edit the sections, those that look like Some section [Edit]. Retrieving the content in raw WikiText appeared to be more manageable, as I said, until I tried to use the WkiText to HTML converter that I created for English in other languages. Since you suggested Markdown could be used instead I tried looking for something in the API to get Markdown from the server side, but couldn't fine anything related.

@CXuesong
Copy link
Owner

CXuesong commented Mar 6, 2018

Okay I've got this wrong. Just ignore my last post. You are going to parse from wikitext anyway. So in this case, you may just need to take a look at disableeditsection and disabletoc parameter for the parse action.

@rraallvv
Copy link
Author

rraallvv commented Mar 6, 2018

@CXuesong , Adding those parameters really does help to strip the content I don't want, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants