Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for HTML, Markdown and Typst files #127

Open
wants to merge 5 commits into
base: v3
Choose a base branch
from

Conversation

Rolv-Apneseth
Copy link
Collaborator

Continuing #117, as per our discussion.

This PR adds parsers for HTML, Markdown and Typst files, which are used in the CLI to support different file types. Auto detection per file is used by default, which will just choose a parser based on the file's extension.

I think the parsers are working OK, but wanted to open this PR as soon as possible so you can let me know what you think, and to see if you have some files that maybe this doesn't work great for.

I decided to take the same path that pyLanguagetool took for Markdown files, by first converting them to HTML, and then using a separate library for parsing that (html_parser in our case). I figured we'd probably want support for HTML eventually anyway, but you can let me know if you think I should take a similar approach to the HTML and Typst parsers.

As for the HTML and Typst parsers, I've inserted some "placeholders" in place of e.g. code blocks, links etc., which LanguageTool ignores, e.g. _code_, _math_, _link_. This seems to work well enough but if you have any other ideas on how to tackle these let me know and I can have a look.

Copy link

codspeed-hq bot commented Dec 22, 2024

CodSpeed Performance Report

Merging #127 will not alter performance

Comparing Rolv-Apneseth:detect_filetype (7d23240) with v3 (1665a9d)

Summary

✅ 6 untouched benchmarks

@Rolv-Apneseth
Copy link
Collaborator Author

Any idea what's going on with the tests? cargo nextest run --all-features --no-capture doesn't produce any failures for me locally

Copy link
Owner

@jeertmans jeertmans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Rolv-Apneseth, thanks for your PR!

See my comments and let me know what you think :-)

Comment on lines +135 to +139
"md" | "mkd" | "mdwn" | "mdown" | "mdtxt" | "mdtext" | "markdown" => {
FileType::Markdown
},

"html" | "htm" => FileType::Html,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any documentation or link that documents those many different extensions for the same filetype?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I just copied this list from https://superuser.com/a/285878. Maybe I should also add mdx as mentioned in a comment, or do you want to cut it down to just md and markdown?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to either stick to simplicity, or to match the default behavior of some popular tool, like ripgrep's default list.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, well I'll copy that list then

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be better to return a vector of data annotations here? So (1) we avoid memory allocation and (2) we can benefit from LT's support for annotated data. The latter also has the advantage that LT is going to compute the correct location of an error in the text, and we can then print annotated errors with respect to the raw content of the file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too familiar with how the data annotation side of things work, and I had issues trying to use it with the current setup that splits the requests, but I can explore it a bit more if you want.

Some notes / questions though:

  1. Should all requests sent by the CLI for files be data annotations, including text, or does that need to be handled separately to the other file types
  2. The requests will be a lot larger than the current plain text ones right? So only much smaller files would be supported
  3. I will also need to change the current approach for markdown files as that wouldn't make sense any more

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think all special types should be checked using data annotation. That excludes raw text files.
  2. That can be an issue with very large files, but let's not bother too much on that. That can be solved by automatically splitting the request into many (which can itself be a problem if we send too many requests to the public API, but people should host then own server).
  3. Probably, yes. Let me know if I can help!

Comment on lines 76 to 86
#[default]
Auto,
/// Raw text.
Raw,
/// Markdown.
Markdown,
/// HTML.
Html,
/// Typst.
Typst,
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't have to be fixed / changed in this feature, but what do you think of the following:

  • add a Text variant that matches .txt files;
  • ignore Raw file types by defaults (because some tools would probably call ltrs will many files as input, like pre-commit, and it doesn't make sense to check for errors in files like .zip, code files, or else);
  • add CLI options --include-raw and --include=ext where the latter includes files with ext extension, using Raw logic.

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds alright yeah. Maybe it shouldn't be "raw" then, more just "unknown" or "unsupported", what do you think? I was thinking "raw" just because it would be sent as-is, but if there's a separate "text" option I think that would be more clear

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think Unknown sounds perfect! I agree that Raw conveys the message greatly, my main point was about not checking arbitrary file types, at least by default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good. I can do this in a separate PR after this one then.

Comment on lines +21 to +31
"code" => {
txt.push_str("_code_");
return;
},
"a" => {
txt.push_str("_link_");
return;
},
"pre" => {
txt.push_str("_pre_");
txt.push_str("\n\n");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit linked to my general file-comment, but there you could simply return a Markup note, with no interpret_as, and LT should ignore it, I think.

@Rolv-Apneseth
Copy link
Collaborator Author

Hi @Rolv-Apneseth, thanks for your PR!

See my comments and let me know what you think :-)

Hey @jeertmans, I'll get back to this in a couple days, thanks for the review though

@jeertmans
Copy link
Owner

Hi @Rolv-Apneseth, thanks for your PR!
See my comments and let me know what you think :-)

Hey @jeertmans, I'll get back to this in a couple days, thanks for the review though

No issue, Merry Christmas and happy end of year!

@Rolv-Apneseth
Copy link
Collaborator Author

No issue, Merry Christmas and happy end of year!

Thanks, you too :)

@jeertmans
Copy link
Owner

Any idea what's going on with the tests? cargo nextest run --all-features --no-capture doesn't produce any failures for me locally

Looks strange! Sometimes, the online version of the LT API returns different results than the one you can host locally. Did you run tests on a self-hosted API?

@Rolv-Apneseth
Copy link
Collaborator Author

Looks strange! Sometimes, the online version of the LT API returns different results than the one you can host locally. Did you run tests on a self-hosted API?

Yes, it was with a self-hosted API. Looks like these errors are on the current v3 branch too though (#117), so I don't think these changes are related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants