refactor: further separate CLI logic from the API related functionality (see #117) #124

Rolv-Apneseth · 2024-10-05T22:37:39Z

Continuation of #117.

As discussed, some refactors. You will probably want some more done here and that's OK, just let me know what to do.

One thing for sure is the String values in api/check.rs - which ones exactly would you like me to make &str or Cow<'_, str>?

Also, for cli/mod.rs, I could potentially remove the need for the manual call of cmd.execute for each subcommand using the enum_dispatch crate if you're OK with adding a dependency? Fine as it is anyway though in my opinion, as there aren't many subcommands.

…for each subcommand

…se `style_edition` instead.

codspeed-hq · 2024-10-05T22:43:55Z

CodSpeed Performance Report

Merging #124 will not alter performance

_{Comparing Rolv-Apneseth:refactor-v3 (c7d342c) with v3 (a7247e4)}

Summary

✅ 6 untouched benchmarks

jeertmans · 2024-10-06T08:56:44Z

Hi @Rolv-Apneseth! I have quickly looked at your PR, it looks very good! I will give it a better look later, but let me answer your first questions.

One thing for sure is the String values in api/check.rs - which ones exactly would you like me to make &str or Cow<'_, str>?

Well, I think &str should be enough, because we will never need to mutate (i.e., overwrite) the text strings, so keeping a reference should be fine. However, If I remember well, Clap makes this hard as you don't get reference to string read from the terminal, but owned string. So parsing text (either pure text or data annotation) from the terminal will required to be able to pass owned data, I fear. Using borrowed data is only useful for when reading files, as we may want to avoid allocating the full content of the files, and read by chunks instead.

So in conclusion, Cow<'_, str> may be the only solution (apart from String).

Also, for cli/mod.rs, I could potentially remove the need for the manual call of cmd.execute for each subcommand using the enum_dispatch crate if you're OK with adding a dependency? Fine as it is anyway though in my opinion, as there aren't many subcommands.

We can give it a try :-)
Dependencies should be fine, especially as we can keep it under the cli feature.

…ute` for each variant of `Command`

Rolv-Apneseth · 2024-10-06T12:10:59Z

I have quickly looked at your PR, it looks very good! I

Great!

Well, I think &str should be enough, because we will never need to mutate (i.e., overwrite) the text strings, so keeping a reference should be fine. However, If I remember well, Clap makes this hard as you don't get reference to string read from the terminal, but owned string. So parsing text (either pure text or data annotation) from the terminal will required to be able to pass owned data, I fear. Using borrowed data is only useful for when reading files, as we may want to avoid allocating the full content of the files, and read by chunks instead.

Sorry, I meant which things in check.rs you actually wanted converted, but from this I'm assuming it's the Request.text? But I am confused how this would be achieved. Do we not need to allocate a String no matter what the source is? As for reading the file by chunks, do you want me to just make this use a BufReader and create and await requests one at a time instead of the way it is currently?

            let text = std::fs::read_to_string(filename)?;
            let requests = request
                .clone()
                .with_text(text.clone())
                .split(self.max_length, self.split_pattern.as_str());
            let response = server_client.check_multiple_and_join(requests).await?;

And for the text.clone() caused by server_client.check_multiple_and_join, I could make that function just return a ResponseWithContext instead of having it convert to Response automatically like it's doing, so we can pass the owned string in and get it back out of the response.

We can give it a try :-)

I've added it in there - I think it fits nicely.

Rolv-Apneseth · 2024-10-06T12:27:05Z

I could make that function just return a ResponseWithContext instead of having it convert to Response automatically like it's doing, so we can pass the owned string in and get it back out of the response.

I've pushed some changes there for this, avoiding the need to clone the input text.

jeertmans · 2024-10-06T14:40:32Z

Sorry, I meant which things in check.rs you actually wanted converted, but from this I'm assuming it's the Request.text?

Yes, Request.text and Request.data's inner fields. Mainly because one could create a vector of data annotations from a stream of Markdown tokens, which are usually &str of the initial source.

But I am confused how this would be achieved.

I think using Cow<'_, str> should make the trick.

Do we not need to allocate a String no matter what the source is?

Yes and no. Yes because reqwest will have to allocate a string for the HTML request, but no because we don't actually need the user to provide us an owned String. We don't care because reqwest will allocate anyway. So it's good to provide flexibility.

As for reading the file by chunks, do you want me to just make this use a BufReader and create and await requests one at a time instead of the way it is currently?

            let text = std::fs::read_to_string(filename)?;
            let requests = request
                .clone()
                .with_text(text.clone())
                .split(self.max_length, self.split_pattern.as_str());
            let response = server_client.check_multiple_and_join(requests).await?;

Yes, this would be great. The split-text logic isn't really trivial to solve, because we want to avoid splitting text where it would break the meaning of a sentence, but also we cannot have long-running text either. I think working on a better split-logic can be a next step. For the moment, we can just use BufReader and read the whole file, and feed the server with multiple requests (if text is too large). We could also read the file by chunks, but I think this is a fair assumption to assume that any file we want to check should fit completely in the memory.

And for the text.clone() caused by server_client.check_multiple_and_join, I could make that function just return a ResponseWithContext instead of having it convert to Response automatically like it's doing, so we can pass the owned string in and get it back out of the response.

Yes, the end-goal would be to remove the complex logic from Server and perform multiple requests merging outside, so the server only has one method per HTTP request, nothing more.

…uest.text`

Rolv-Apneseth · 2024-10-08T21:34:59Z

Hey again. Had a lot of trouble with lifetimes and I could only get 'static to work - not sure I love the whole solution, but if you havea look at the latest commit, is that what you had in mind?

jeertmans · 2024-10-09T07:25:48Z

Hello @Rolv-Apneseth, rapidly looking at your commit, I think you just forgot to add the lifetime parameter in Request -> Request<'source> (I would prefer using 'source over anonymous lifetime names like 'a). This should explain why you could only use 'static.

However, it makes little sense to have reference in responses, because reqwest::RequestBuilder::send returns an owned Response. We can keep owned String in that case :-)

This will replicate a similar behavior to that of reqwest. E.g., the builder takes a reference to be constructed pub fn json<T: Serialize + ?Sized>(self, json: &T) -> RequestBuilder, but the response returns an owned struct: pub async fn json<T: DeserializeOwned>(self) -> Result<T>.

jeertmans · 2024-10-09T07:30:05Z

src/api/check.rs

-            .as_ref()
-            .ok_or(Error::InvalidRequest("missing text field".to_string()))?;
+    pub fn try_split(mut self, n: usize, pat: &str) -> Result<Vec<Self>> {
+        let text = mem::take(&mut self.text)


Here, one goal would be that try_split returns a reference to &self, so the signature should be as follows:

pub fn try_split<'source>(&'source self, n: usize, pat: &str) -> Result<Vec<Self<'source>>>

Note that we should not need to mutate the actual request, because we only take references.

Rolv-Apneseth · 2024-10-09T09:31:19Z

I can have a go again but something was complaining about needing to live longer than 'static. As for Response, that makes sense.

jeertmans · 2024-10-09T09:45:41Z

I can have a go again but something was complaining about needing to live longer than 'static. As for Response, that makes sense.

No issue. If you can't make it compile, just push with 'source, and I will try to fix it myself :-)

Rolv-Apneseth · 2024-10-09T15:53:30Z

So this code:

#[cfg_attr(feature = "cli", derive(Args))]
#[derive(Clone, Debug, PartialEq, Eq, Serialize, Hash)]
#[serde(rename_all = "camelCase")]
#[non_exhaustive]
pub struct Request<'source> {
    /// The text to be checked. This or 'data' is required.
    #[cfg_attr(
        feature = "cli",
        clap(short = 't', long, conflicts_with = "data", allow_hyphen_values(true))
    )]
    #[serde(skip_serializing_if = "Option::is_none")]
    pub text: Option<Cow<'source, str>>,

Leads to this error:

error: lifetime may not live long enough
   --> src/api/check.rs:398:15
    |
391 | pub struct Request<'source> {
    |                    ------- lifetime `'source` defined here
...
398 |     pub text: Option<Cow<'source, str>>,
    |               ^^^^^^ requires that `'source` must outlive `'static`

error: lifetime may not live long enough
   --> src/api/check.rs:392:5
    |
391 | pub struct Request<'source> {
    |                    ------- lifetime `'source` defined here
392 |     /// The text to be checked. This or 'data' is required.
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ argument requires that `'source` must outlive `'static`
    |
    = note: this error originates in the macro `clap::value_parser` (in Nightly builds, run with -Z macro-backtrace for more info)

So I don't think it will work as it is. Should I take out clap from here and have a separate struct for the request arguments?

jeertmans · 2024-10-09T16:15:11Z

^ requires that 'source must outlive 'static

Oh yeah, I remember why I struggled with &str back then ^^'

Creating a separate struct would work, but that means duplicating a lot of the code. I wonder if there is still a way to allow 'source to take 'static in the case of App::parse, but not in general... This isn't as easy as I imagined

Rolv-Apneseth · 2024-10-09T16:53:29Z

I wonder if there is still a way to allow 'source to take 'static in the case of App::parse, but not in general

Not that I can think of anyway. I'd say the options are to either stick with String, use 'static, or a different struct. For the different struct, could an impl From keep fields in sync at least?

jeertmans · 2024-10-11T07:30:01Z

I wonder if there is still a way to allow 'source to take 'static in the case of App::parse, but not in general

Not that I can think of anyway. I'd say the options are to either stick with String, use 'static, or a different struct. For the different struct, could an impl From keep fields in sync at least?

Ok so some news: at the moment, it seems impossible to do that in one struct, see clap-rs/clap#5773 and the related discussion, so I suggest we duplicated the structure: to original Request, with Cow<'source, str>, and the clap-compatible RequestCli, with String. Maybe we can simply avoid duplicating code with a macro rule. If you don't see how to do that, I think I can put have some time during the weekend for this.

Rolv-Apneseth · 2024-10-12T09:33:47Z

I think it would have to be a procedural macro right? Can't see how to do that with a declarative one.

I have a bit of time today so I'll have a look but feel free to do it if you have a good idea of how it should be done

jeertmans · 2024-10-12T09:56:18Z

I actually started some work on your branch today, I will hopefully have something working this afternoon :)

Rolv-Apneseth · 2024-10-13T22:32:22Z

Hey - been busy moving this weekend. If I have any time in the coming days I can have a look for any crates that could solve this problem - if not, yes, copy paste might have to do for now

jeertmans · 2024-10-14T07:06:36Z

Hey - been busy moving this weekend. If I have any time in the coming days I can have a look for any crates that could solve this problem - if not, yes, copy paste might have to do for now

No issue! Yeah, maybe let's copy and paste (keeping the 'source lifetime for the api/click.rs and use String in cli/checks.rs). The "easiest" solution would be to fix the upstream issue, i.e., do a PR to Clap's repository, but that I will only look at after this month,

Rolv-Apneseth · 2024-10-17T21:57:30Z

Yeah couldn't find any crates for this situation.

I do think just copying the struct is better but to provide another option, what do you think of something like this:

pub struct Request<'source> {
    // HACK: keep compiler happy when enabling the cli feature flag
    #[cfg(feature = "cli")]
    #[clap(skip)]
    _lifetime: PhantomData<&'source ()>,
    /// (...docs)
    #[cfg(feature = "cli")]
    #[clap(short = 't', long, conflicts_with = "data", allow_hyphen_values(true))]
    #[serde(skip_serializing_if = "Option::is_none")]
    pub text: Option<Cow<'static, str>>,
    /// (...docs)
    #[cfg(not(feature = "cli"))]
    #[serde(skip_serializing_if = "Option::is_none")]
    pub text: Option<Cow<'source, str>>,
    /// (...docs)
    #[cfg(feature = "cli")]
    #[clap(short = 'd', long, conflicts_with = "text")]
    #[serde(skip_serializing_if = "Option::is_none")]
    pub data: Option<Data<'static>>,
    /// (...docs)
    #[cfg(not(feature = "cli"))]
    #[serde(skip_serializing_if = "Option::is_none")]
    pub data: Option<Data<'source>>,

As for the upstream issue - I had a look but I believe it's beyond my skills. Needs quite a bit of familiarity with how their derive macros work, and the author seems to agree having assigned it that E-hard tag. If you want to hold this off until you have a look at making a PR yourself though, that would be fine by me.

jeertmans · 2024-10-18T07:15:33Z

e struct is better but to provide another option,

This is not a wrong idea, but the issue is that compiling with cli will then enforce 'static lifetime everywhere. Which is something we want to avoid even in the CLI, when we read text from files with ltrs check FILES...

This is an edge case, but we don't want to have to keep a static lifetime reference to the content of each file, where we don't need it. So copying the struct is the only we to both have Clap happy and still be able to use non-static lifetime when desired.

As for the upstream issue - I had a look but I believe it's beyond my skills. Needs quite a bit of familiarity with how their derive macros work, and the author seems to agree having assigned it that E-hard tag. If you want to hold this off until you have a look at making a PR yourself though, that would be fine by me.

I agree, this is not easy, especially as it involves proc macros!

…eature Required cloning structs and methods from `api::check` to separate out the `clap` functionality, as `clap` wouldn't support the lifetime without it being `'static`

Rolv-Apneseth · 2024-10-20T14:05:46Z

So, it's not ideal, and I had to use the lifetime crate, but what do you think of the latest commit?

Perhaps it's better to leave those fields as String until clap allows us to avoid all the duplication?

jeertmans

Hello @Rolv-Apneseth! It looks great, I just have a few comments that I left :-)

Please let me know what you think of it!

jeertmans · 2024-10-21T07:36:31Z

src/api/check.rs

@@ -963,7 +953,7 @@ impl Response {
 #[derive(Debug, Clone, PartialEq)]
 pub struct ResponseWithContext {
    /// Original text that was checked by LT.
-    pub text: Cow<'static, str>,
+    pub text: String,


Why does it have to be an owned string? If I am correct, the "context" is simply a source text, so should be &'source str, no?

I just had issues with lifetimes, in particular check_multiple_and_join in server.rs was a pain due to those tokio tasks that are spawned. However, I'm having a look at if it can maybe work with IntoStatic.

What is the match positions section in this code block supposed to do? It's causing a lifetime issue and I don't see the values being returned / used for anything:

impl<'source> From<ResponseWithContext<'source>> for Response { #[allow(clippy::needless_borrow)] fn from(mut resp: ResponseWithContext<'source>) -> Self { let iter: MatchPositions<'_, std::slice::IterMut<'_, Match>> = (&mut resp).into(); for (line_number, line_offset, m) in iter { m.more_context = Some(MoreContext { line_number, line_offset, }); } resp.response } }

Hum I see, probably using multiple tasks may conflict with lifetime.

The match iter is an iterator that will automatically count lines and line offsets, so we can then add this information to the Match object.

Thanks that helped, so that piece of code modifies the response with some additional context before returning it.

Have a look at the latest commit and see if you're happy with the changes as I needed some workarounds for more lifetime issues due to that block of code above.

I could improve it if we drop the implementation for Iter and just keep the one for IterMut, so that it doesn't need to be generic. Also, maybe the modification of response with that additional context could happen when ResponseWithContext is first created, rather than only when converting it to Response?

Hi @Rolv-Apneseth! Your work looks very promising, but I have a paper deadline (next Thursday) that is taking most of my time, so I will (hopefully, if the deadline is not postponed) have time to review your work on next Friday :-)

Of course, no worries, you can take your time. And good luck!

Were you alright with these changes then? If so I can move on to the de-duplication of the DataAnnotation and CliRequest methods

Also - hope the paper went well

src/cli/check.rs

jeertmans · 2024-10-21T07:40:58Z

src/cli/check.rs

+impl CliDataAnnotation {
+    /// Instantiate a new `CliDataAnnotation` with text only.
+    #[inline]
+    #[must_use]
+    pub fn new_text<T: Into<String>>(text: T) -> Self {
+        Self {
+            text: Some(text.into()),
+            markup: None,
+            interpret_as: None,
+        }
+    }
+
+    /// Instantiate a new `CliDataAnnotation` with markup only.
+    #[inline]
+    #[must_use]
+    pub fn new_markup<M: Into<String>>(markup: M) -> Self {
+        Self {
+            text: None,
+            markup: Some(markup.into()),
+            interpret_as: None,
+        }
+    }
+
+    /// Instantiate a new `CliDataAnnotation` with markup and its
+    /// interpretation.
+    #[inline]
+    #[must_use]
+    pub fn new_interpreted_markup<M: Into<String>, I: Into<String>>(
+        markup: M,
+        interpret_as: I,
+    ) -> Self {
+        Self {
+            interpret_as: Some(interpret_as.into()),
+            markup: Some(markup.into()),
+            text: None,
+        }
+    }
+
+    /// Return the text or markup within the data annotation.
+    ///
+    /// # Errors
+    ///
+    /// If this data annotation does not contain text or markup.
+    pub fn try_get_text(&self) -> Result<String> {
+        if let Some(ref text) = self.text {
+            Ok(text.clone())
+        } else if let Some(ref markup) = self.markup {
+            Ok(markup.clone())
+        } else {
+            Err(Error::InvalidDataAnnotation(format!(
+                "missing either text or markup field in {self:?}"
+            )))
+        }
+    }
+}


Same remark as for CliRequest

…hContext`

jeertmans

Hello @Rolv-Apneseth, I finally have some time to review this work properly :-)

I did a new pass, and it looks excellent! I suggested a few changes, especially for readability, and added two questions regarding two static lifetimes.

I see that most tests are passing, except a few ones, but didn't yet understand why those tests were failing.

README.md

benches/benchmarks/check_texts.rs

src/api/check.rs

src/api/server.rs

tests/match_positions.rs

Co-authored-by: Jérome Eertmans <[email protected]>

jeertmans · 2024-11-16T10:55:54Z

Hello @Rolv-Apneseth, sorry for lagging a bit behind those last weeks. I looked at failing test a bit in depth and I don't mind merging this PR.

We will take care of making those tests pass in the main PR :-)

Rolv-Apneseth · 2024-11-16T10:58:42Z

Oh - I hadn't done the de-duplication yet - shall I do that and you can merge this branch again? I was waiting for a reply to this:

Were you alright with these changes then? If so I can move on to the de-duplication of the DataAnnotation and CliRequest methods

jeertmans · 2024-11-16T11:06:33Z

Oh, sorry! Well, given your large contribution to this repo, I consider adding you as a contributor so you can directly edit v3, etc.

Should I add you?

Rolv-Apneseth · 2024-11-16T11:13:30Z

I mean, as long as you're comfortable with it. Personally I wouldn't trust strangers on the internet so fast haha

jeertmans · 2024-11-16T11:17:14Z

I mean, as long as you're comfortable with it. Personally I wouldn't trust strangers on the internet so fast haha

I know, but branch protection rules are good to prevent issues :-)

Rolv-Apneseth · 2024-11-16T11:20:08Z

Fair enough yeah. Thanks for the invite, I'll push that deduplication to v3 (probably) later today

jeertmans · 2024-11-16T11:25:31Z

Thanks! I plan to be more active on this repo myself, as I plan on using this combined with Typst to write my thesis, but this is unfortunately not a priority at the moment, and this is why I prefer to give you more freedom to work on this project while you are motivated, rather than to stale it forever haha

Rolv-Apneseth added 10 commits October 5, 2024 15:51

refactor: use std::ops::Not::not instead of custom is_false function

7281a5d

refactor: further separate API logic from CLI, and create submodules …

56f0ab8

…for each subcommand

fix: fmt

3b6f9d7

fix: nightly rustfmt warning: the version option is deprecated. U…

0342b89

…se `style_edition` instead.

fix: fmt

dbd4caa

refactor: ProcessCommand -> process::Command

12afefc

fix: remove unused imports

afe2b6e

refactor: make link more readable for CLI

00b9491

fix: clippy and doc warnings

fd17b9e

chore: bump minimum rust version

a6ea313

fix: misc CI issues

15d1454

feat: use enum_dispatch to avoid needing to manually call `cmd.exec…

673ff5d

…ute` for each variant of `Command`

refactor: avoid cloning request input string

2575c87

refactor: use Cow<'static, str> instead of String for `check::Req…

b863f36

…uest.text`

jeertmans reviewed Oct 9, 2024

View reviewed changes

refactor: use Cow<'source, str> when not compiling with the cli f…

a1b258d

…eature Required cloning structs and methods from `api::check` to separate out the `clap` functionality, as `clap` wouldn't support the lifetime without it being `'static`

jeertmans reviewed Oct 21, 2024

View reviewed changes

Rolv-Apneseth added 2 commits October 23, 2024 14:29

fix: remove unnecessary clone of split_pattern

642c34f

refactor: use Cow<'source, str> for text referenced by `ResponseWit…

b6cd61b

…hContext`

jeertmans reviewed Nov 1, 2024

View reviewed changes

jeertmans and others added 11 commits November 1, 2024 10:56

Update CI.yml

20e84db

Update README.md

f091ffe

Co-authored-by: Jérome Eertmans <[email protected]>

Update benches/benchmarks/check_texts.rs

9273c68

Co-authored-by: Jérome Eertmans <[email protected]>

Update benches/benchmarks/check_texts.rs

b9c1169

Co-authored-by: Jérome Eertmans <[email protected]>

Update src/api/server.rs

47b07a4

Co-authored-by: Jérome Eertmans <[email protected]>

Update tests/match_positions.rs

ba691a4

Co-authored-by: Jérome Eertmans <[email protected]>

fix: formatting

5b2c235

fix: remove unused imports

c875a5f

fix: remove static lifetime from with_text and with_data

f162b17

fix: correct length addition and use Cow::to_mut

befbc52

fix: satisfy clippy pre-commit hook

c7d342c

jeertmans merged commit a067522 into jeertmans:v3 Nov 16, 2024
17 of 20 checks passed

Rolv-Apneseth deleted the refactor-v3 branch November 16, 2024 11:20

refactor: further separate CLI logic from the API related functionality (see #117) #124

refactor: further separate CLI logic from the API related functionality (see #117) #124

Conversation

Rolv-Apneseth commented Oct 5, 2024

codspeed-hq bot commented Oct 5, 2024 • edited Loading

Merging #124 will not alter performance

Summary

jeertmans commented Oct 6, 2024

Rolv-Apneseth commented Oct 6, 2024

Rolv-Apneseth commented Oct 6, 2024

jeertmans commented Oct 6, 2024

Rolv-Apneseth commented Oct 8, 2024

jeertmans commented Oct 9, 2024

Choose a reason for hiding this comment

Rolv-Apneseth commented Oct 9, 2024

jeertmans commented Oct 9, 2024

Rolv-Apneseth commented Oct 9, 2024

jeertmans commented Oct 9, 2024

Rolv-Apneseth commented Oct 9, 2024

jeertmans commented Oct 11, 2024

Rolv-Apneseth commented Oct 12, 2024

jeertmans commented Oct 12, 2024

Rolv-Apneseth commented Oct 13, 2024

jeertmans commented Oct 14, 2024

Rolv-Apneseth commented Oct 17, 2024

jeertmans commented Oct 18, 2024

Rolv-Apneseth commented Oct 20, 2024

jeertmans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeertmans left a comment

Choose a reason for hiding this comment

jeertmans commented Nov 16, 2024

Rolv-Apneseth commented Nov 16, 2024

jeertmans commented Nov 16, 2024

Rolv-Apneseth commented Nov 16, 2024

jeertmans commented Nov 16, 2024

Rolv-Apneseth commented Nov 16, 2024

jeertmans commented Nov 16, 2024

codspeed-hq bot commented Oct 5, 2024 •

edited

Loading