Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: further separate CLI logic from the API related functionality (see #117) #124

Merged
merged 29 commits into from
Nov 16, 2024

Conversation

Rolv-Apneseth
Copy link
Collaborator

Continuation of #117.

As discussed, some refactors. You will probably want some more done here and that's OK, just let me know what to do.

One thing for sure is the String values in api/check.rs - which ones exactly would you like me to make &str or Cow<'_, str>?

Also, for cli/mod.rs, I could potentially remove the need for the manual call of cmd.execute for each subcommand using the enum_dispatch crate if you're OK with adding a dependency? Fine as it is anyway though in my opinion, as there aren't many subcommands.

Copy link

codspeed-hq bot commented Oct 5, 2024

CodSpeed Performance Report

Merging #124 will not alter performance

Comparing Rolv-Apneseth:refactor-v3 (c7d342c) with v3 (a7247e4)

Summary

✅ 6 untouched benchmarks

@jeertmans
Copy link
Owner

Hi @Rolv-Apneseth! I have quickly looked at your PR, it looks very good! I will give it a better look later, but let me answer your first questions.

One thing for sure is the String values in api/check.rs - which ones exactly would you like me to make &str or Cow<'_, str>?

Well, I think &str should be enough, because we will never need to mutate (i.e., overwrite) the text strings, so keeping a reference should be fine. However, If I remember well, Clap makes this hard as you don't get reference to string read from the terminal, but owned string. So parsing text (either pure text or data annotation) from the terminal will required to be able to pass owned data, I fear. Using borrowed data is only useful for when reading files, as we may want to avoid allocating the full content of the files, and read by chunks instead.

So in conclusion, Cow<'_, str> may be the only solution (apart from String).

Also, for cli/mod.rs, I could potentially remove the need for the manual call of cmd.execute for each subcommand using the enum_dispatch crate if you're OK with adding a dependency? Fine as it is anyway though in my opinion, as there aren't many subcommands.

We can give it a try :-)
Dependencies should be fine, especially as we can keep it under the cli feature.

@Rolv-Apneseth
Copy link
Collaborator Author

I have quickly looked at your PR, it looks very good! I

Great!

Well, I think &str should be enough, because we will never need to mutate (i.e., overwrite) the text strings, so keeping a reference should be fine. However, If I remember well, Clap makes this hard as you don't get reference to string read from the terminal, but owned string. So parsing text (either pure text or data annotation) from the terminal will required to be able to pass owned data, I fear. Using borrowed data is only useful for when reading files, as we may want to avoid allocating the full content of the files, and read by chunks instead.

Sorry, I meant which things in check.rs you actually wanted converted, but from this I'm assuming it's the Request.text? But I am confused how this would be achieved. Do we not need to allocate a String no matter what the source is? As for reading the file by chunks, do you want me to just make this use a BufReader and create and await requests one at a time instead of the way it is currently?

            let text = std::fs::read_to_string(filename)?;
            let requests = request
                .clone()
                .with_text(text.clone())
                .split(self.max_length, self.split_pattern.as_str());
            let response = server_client.check_multiple_and_join(requests).await?;

And for the text.clone() caused by server_client.check_multiple_and_join, I could make that function just return a ResponseWithContext instead of having it convert to Response automatically like it's doing, so we can pass the owned string in and get it back out of the response.

We can give it a try :-)

I've added it in there - I think it fits nicely.

@Rolv-Apneseth
Copy link
Collaborator Author

I could make that function just return a ResponseWithContext instead of having it convert to Response automatically like it's doing, so we can pass the owned string in and get it back out of the response.

I've pushed some changes there for this, avoiding the need to clone the input text.

@jeertmans
Copy link
Owner

Sorry, I meant which things in check.rs you actually wanted converted, but from this I'm assuming it's the Request.text?

Yes, Request.text and Request.data's inner fields. Mainly because one could create a vector of data annotations from a stream of Markdown tokens, which are usually &str of the initial source.

But I am confused how this would be achieved.

I think using Cow<'_, str> should make the trick.

Do we not need to allocate a String no matter what the source is?

Yes and no. Yes because reqwest will have to allocate a string for the HTML request, but no because we don't actually need the user to provide us an owned String. We don't care because reqwest will allocate anyway. So it's good to provide flexibility.

As for reading the file by chunks, do you want me to just make this use a BufReader and create and await requests one at a time instead of the way it is currently?

            let text = std::fs::read_to_string(filename)?;
            let requests = request
                .clone()
                .with_text(text.clone())
                .split(self.max_length, self.split_pattern.as_str());
            let response = server_client.check_multiple_and_join(requests).await?;

Yes, this would be great. The split-text logic isn't really trivial to solve, because we want to avoid splitting text where it would break the meaning of a sentence, but also we cannot have long-running text either. I think working on a better split-logic can be a next step. For the moment, we can just use BufReader and read the whole file, and feed the server with multiple requests (if text is too large). We could also read the file by chunks, but I think this is a fair assumption to assume that any file we want to check should fit completely in the memory.

And for the text.clone() caused by server_client.check_multiple_and_join, I could make that function just return a ResponseWithContext instead of having it convert to Response automatically like it's doing, so we can pass the owned string in and get it back out of the response.

Yes, the end-goal would be to remove the complex logic from Server and perform multiple requests merging outside, so the server only has one method per HTTP request, nothing more.

@Rolv-Apneseth
Copy link
Collaborator Author

Hey again. Had a lot of trouble with lifetimes and I could only get 'static to work - not sure I love the whole solution, but if you havea look at the latest commit, is that what you had in mind?

@jeertmans
Copy link
Owner

Hello @Rolv-Apneseth, rapidly looking at your commit, I think you just forgot to add the lifetime parameter in Request -> Request<'source> (I would prefer using 'source over anonymous lifetime names like 'a). This should explain why you could only use 'static.

However, it makes little sense to have reference in responses, because reqwest::RequestBuilder::send returns an owned Response. We can keep owned String in that case :-)

This will replicate a similar behavior to that of reqwest. E.g., the builder takes a reference to be constructed pub fn json<T: Serialize + ?Sized>(self, json: &T) -> RequestBuilder, but the response returns an owned struct: pub async fn json<T: DeserializeOwned>(self) -> Result<T>.

.as_ref()
.ok_or(Error::InvalidRequest("missing text field".to_string()))?;
pub fn try_split(mut self, n: usize, pat: &str) -> Result<Vec<Self>> {
let text = mem::take(&mut self.text)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, one goal would be that try_split returns a reference to &self, so the signature should be as follows:

pub fn try_split<'source>(&'source self, n: usize, pat: &str) -> Result<Vec<Self<'source>>>

Note that we should not need to mutate the actual request, because we only take references.

@Rolv-Apneseth
Copy link
Collaborator Author

I can have a go again but something was complaining about needing to live longer than 'static. As for Response, that makes sense.

@jeertmans
Copy link
Owner

I can have a go again but something was complaining about needing to live longer than 'static. As for Response, that makes sense.

No issue. If you can't make it compile, just push with 'source, and I will try to fix it myself :-)

@Rolv-Apneseth
Copy link
Collaborator Author

So this code:

#[cfg_attr(feature = "cli", derive(Args))]
#[derive(Clone, Debug, PartialEq, Eq, Serialize, Hash)]
#[serde(rename_all = "camelCase")]
#[non_exhaustive]
pub struct Request<'source> {
    /// The text to be checked. This or 'data' is required.
    #[cfg_attr(
        feature = "cli",
        clap(short = 't', long, conflicts_with = "data", allow_hyphen_values(true))
    )]
    #[serde(skip_serializing_if = "Option::is_none")]
    pub text: Option<Cow<'source, str>>,

Leads to this error:

error: lifetime may not live long enough
   --> src/api/check.rs:398:15
    |
391 | pub struct Request<'source> {
    |                    ------- lifetime `'source` defined here
...
398 |     pub text: Option<Cow<'source, str>>,
    |               ^^^^^^ requires that `'source` must outlive `'static`

error: lifetime may not live long enough
   --> src/api/check.rs:392:5
    |
391 | pub struct Request<'source> {
    |                    ------- lifetime `'source` defined here
392 |     /// The text to be checked. This or 'data' is required.
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ argument requires that `'source` must outlive `'static`
    |
    = note: this error originates in the macro `clap::value_parser` (in Nightly builds, run with -Z macro-backtrace for more info)

So I don't think it will work as it is. Should I take out clap from here and have a separate struct for the request arguments?

@jeertmans
Copy link
Owner

^ requires that 'source must outlive 'static

Oh yeah, I remember why I struggled with &str back then ^^'

Creating a separate struct would work, but that means duplicating a lot of the code. I wonder if there is still a way to allow 'source to take 'static in the case of App::parse, but not in general... This isn't as easy as I imagined

@Rolv-Apneseth
Copy link
Collaborator Author

I wonder if there is still a way to allow 'source to take 'static in the case of App::parse, but not in general

Not that I can think of anyway. I'd say the options are to either stick with String, use 'static, or a different struct. For the different struct, could an impl From keep fields in sync at least?

@jeertmans
Copy link
Owner

I wonder if there is still a way to allow 'source to take 'static in the case of App::parse, but not in general

Not that I can think of anyway. I'd say the options are to either stick with String, use 'static, or a different struct. For the different struct, could an impl From keep fields in sync at least?

Ok so some news: at the moment, it seems impossible to do that in one struct, see clap-rs/clap#5773 and the related discussion, so I suggest we duplicated the structure: to original Request, with Cow<'source, str>, and the clap-compatible RequestCli, with String. Maybe we can simply avoid duplicating code with a macro rule. If you don't see how to do that, I think I can put have some time during the weekend for this.

@Rolv-Apneseth
Copy link
Collaborator Author

I think it would have to be a procedural macro right? Can't see how to do that with a declarative one.

I have a bit of time today so I'll have a look but feel free to do it if you have a good idea of how it should be done

@jeertmans
Copy link
Owner

I actually started some work on your branch today, I will hopefully have something working this afternoon :)

@Rolv-Apneseth
Copy link
Collaborator Author

Hey - been busy moving this weekend. If I have any time in the coming days I can have a look for any crates that could solve this problem - if not, yes, copy paste might have to do for now

@jeertmans
Copy link
Owner

Hey - been busy moving this weekend. If I have any time in the coming days I can have a look for any crates that could solve this problem - if not, yes, copy paste might have to do for now

No issue! Yeah, maybe let's copy and paste (keeping the 'source lifetime for the api/click.rs and use String in cli/checks.rs). The "easiest" solution would be to fix the upstream issue, i.e., do a PR to Clap's repository, but that I will only look at after this month,

@Rolv-Apneseth
Copy link
Collaborator Author

Yeah couldn't find any crates for this situation.

I do think just copying the struct is better but to provide another option, what do you think of something like this:

pub struct Request<'source> {
    // HACK: keep compiler happy when enabling the cli feature flag
    #[cfg(feature = "cli")]
    #[clap(skip)]
    _lifetime: PhantomData<&'source ()>,
    /// (...docs)
    #[cfg(feature = "cli")]
    #[clap(short = 't', long, conflicts_with = "data", allow_hyphen_values(true))]
    #[serde(skip_serializing_if = "Option::is_none")]
    pub text: Option<Cow<'static, str>>,
    /// (...docs)
    #[cfg(not(feature = "cli"))]
    #[serde(skip_serializing_if = "Option::is_none")]
    pub text: Option<Cow<'source, str>>,
    /// (...docs)
    #[cfg(feature = "cli")]
    #[clap(short = 'd', long, conflicts_with = "text")]
    #[serde(skip_serializing_if = "Option::is_none")]
    pub data: Option<Data<'static>>,
    /// (...docs)
    #[cfg(not(feature = "cli"))]
    #[serde(skip_serializing_if = "Option::is_none")]
    pub data: Option<Data<'source>>,

As for the upstream issue - I had a look but I believe it's beyond my skills. Needs quite a bit of familiarity with how their derive macros work, and the author seems to agree having assigned it that E-hard tag. If you want to hold this off until you have a look at making a PR yourself though, that would be fine by me.

@jeertmans
Copy link
Owner

e struct is better but to provide another option,

This is not a wrong idea, but the issue is that compiling with cli will then enforce 'static lifetime everywhere. Which is something we want to avoid even in the CLI, when we read text from files with ltrs check FILES...

This is an edge case, but we don't want to have to keep a static lifetime reference to the content of each file, where we don't need it. So copying the struct is the only we to both have Clap happy and still be able to use non-static lifetime when desired.

As for the upstream issue - I had a look but I believe it's beyond my skills. Needs quite a bit of familiarity with how their derive macros work, and the author seems to agree having assigned it that E-hard tag. If you want to hold this off until you have a look at making a PR yourself though, that would be fine by me.

I agree, this is not easy, especially as it involves proc macros!

…eature

Required cloning structs and methods from `api::check` to separate out
the `clap` functionality, as `clap` wouldn't support the lifetime
without it being `'static`
@Rolv-Apneseth
Copy link
Collaborator Author

So, it's not ideal, and I had to use the lifetime crate, but what do you think of the latest commit?

Perhaps it's better to leave those fields as String until clap allows us to avoid all the duplication?

Copy link
Owner

@jeertmans jeertmans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @Rolv-Apneseth! It looks great, I just have a few comments that I left :-)

Please let me know what you think of it!

src/api/check.rs Outdated
@@ -963,7 +953,7 @@ impl Response {
#[derive(Debug, Clone, PartialEq)]
pub struct ResponseWithContext {
/// Original text that was checked by LT.
pub text: Cow<'static, str>,
pub text: String,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it have to be an owned string? If I am correct, the "context" is simply a source text, so should be &'source str, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had issues with lifetimes, in particular check_multiple_and_join in server.rs was a pain due to those tokio tasks that are spawned. However, I'm having a look at if it can maybe work with IntoStatic.

What is the match positions section in this code block supposed to do? It's causing a lifetime issue and I don't see the values being returned / used for anything:

impl<'source> From<ResponseWithContext<'source>> for Response {
    #[allow(clippy::needless_borrow)]
    fn from(mut resp: ResponseWithContext<'source>) -> Self {
        let iter: MatchPositions<'_, std::slice::IterMut<'_, Match>> = (&mut resp).into();

        for (line_number, line_offset, m) in iter {
            m.more_context = Some(MoreContext {
                line_number,
                line_offset,
            });
        }
        resp.response
    }
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum I see, probably using multiple tasks may conflict with lifetime.

The match iter is an iterator that will automatically count lines and line offsets, so we can then add this information to the Match object.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks that helped, so that piece of code modifies the response with some additional context before returning it.

Have a look at the latest commit and see if you're happy with the changes as I needed some workarounds for more lifetime issues due to that block of code above.

I could improve it if we drop the implementation for Iter and just keep the one for IterMut, so that it doesn't need to be generic. Also, maybe the modification of response with that additional context could happen when ResponseWithContext is first created, rather than only when converting it to Response?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Rolv-Apneseth! Your work looks very promising, but I have a paper deadline (next Thursday) that is taking most of my time, so I will (hopefully, if the deadline is not postponed) have time to review your work on next Friday :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, no worries, you can take your time. And good luck!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you alright with these changes then? If so I can move on to the de-duplication of the DataAnnotation and CliRequest methods

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - hope the paper went well

src/cli/check.rs Outdated Show resolved Hide resolved
src/cli/check.rs Show resolved Hide resolved
Comment on lines +442 to +496
impl CliDataAnnotation {
/// Instantiate a new `CliDataAnnotation` with text only.
#[inline]
#[must_use]
pub fn new_text<T: Into<String>>(text: T) -> Self {
Self {
text: Some(text.into()),
markup: None,
interpret_as: None,
}
}

/// Instantiate a new `CliDataAnnotation` with markup only.
#[inline]
#[must_use]
pub fn new_markup<M: Into<String>>(markup: M) -> Self {
Self {
text: None,
markup: Some(markup.into()),
interpret_as: None,
}
}

/// Instantiate a new `CliDataAnnotation` with markup and its
/// interpretation.
#[inline]
#[must_use]
pub fn new_interpreted_markup<M: Into<String>, I: Into<String>>(
markup: M,
interpret_as: I,
) -> Self {
Self {
interpret_as: Some(interpret_as.into()),
markup: Some(markup.into()),
text: None,
}
}

/// Return the text or markup within the data annotation.
///
/// # Errors
///
/// If this data annotation does not contain text or markup.
pub fn try_get_text(&self) -> Result<String> {
if let Some(ref text) = self.text {
Ok(text.clone())
} else if let Some(ref markup) = self.markup {
Ok(markup.clone())
} else {
Err(Error::InvalidDataAnnotation(format!(
"missing either text or markup field in {self:?}"
)))
}
}
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark as for CliRequest

Copy link
Owner

@jeertmans jeertmans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @Rolv-Apneseth, I finally have some time to review this work properly :-)

I did a new pass, and it looks excellent! I suggested a few changes, especially for readability, and added two questions regarding two static lifetimes.

I see that most tests are passing, except a few ones, but didn't yet understand why those tests were failing.

README.md Outdated Show resolved Hide resolved
benches/benchmarks/check_texts.rs Outdated Show resolved Hide resolved
benches/benchmarks/check_texts.rs Outdated Show resolved Hide resolved
benches/benchmarks/check_texts.rs Outdated Show resolved Hide resolved
src/api/check.rs Outdated Show resolved Hide resolved
src/api/check.rs Outdated Show resolved Hide resolved
src/api/server.rs Outdated Show resolved Hide resolved
tests/match_positions.rs Outdated Show resolved Hide resolved
@jeertmans
Copy link
Owner

Hello @Rolv-Apneseth, sorry for lagging a bit behind those last weeks. I looked at failing test a bit in depth and I don't mind merging this PR.

We will take care of making those tests pass in the main PR :-)

@jeertmans jeertmans merged commit a067522 into jeertmans:v3 Nov 16, 2024
17 of 20 checks passed
@Rolv-Apneseth
Copy link
Collaborator Author

Oh - I hadn't done the de-duplication yet - shall I do that and you can merge this branch again? I was waiting for a reply to this:

Were you alright with these changes then? If so I can move on to the de-duplication of the DataAnnotation and CliRequest methods

@jeertmans
Copy link
Owner

Oh, sorry! Well, given your large contribution to this repo, I consider adding you as a contributor so you can directly edit v3, etc.

Should I add you?

@Rolv-Apneseth
Copy link
Collaborator Author

I mean, as long as you're comfortable with it. Personally I wouldn't trust strangers on the internet so fast haha

@jeertmans
Copy link
Owner

I mean, as long as you're comfortable with it. Personally I wouldn't trust strangers on the internet so fast haha

I know, but branch protection rules are good to prevent issues :-)

@Rolv-Apneseth
Copy link
Collaborator Author

Fair enough yeah. Thanks for the invite, I'll push that deduplication to v3 (probably) later today

@Rolv-Apneseth Rolv-Apneseth deleted the refactor-v3 branch November 16, 2024 11:20
@jeertmans
Copy link
Owner

Thanks! I plan to be more active on this repo myself, as I plan on using this combined with Typst to write my thesis, but this is unfortunately not a priority at the moment, and this is why I prefer to give you more freedom to work on this project while you are motivated, rather than to stale it forever haha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants