Alternative formatting processing #42

cdesaintguilhem · 2024-10-14T16:12:30Z

Hi there, as I was tinkering to figure out how to wrap closer to the given wrapping parameter one thing led to another and I arrived at an alternative logic for some of the steps that the formatting loop makes. I thought I'd share the code here in case you would be interested; if there are particular changes you like, I would be happy to cherry-pick them into their own PR and close this one up for the archives.

All the changes in this PR are to the logic only and not to the formatting style, all tests still pass unchanged (save for the addition of a \begin{env1} inside a comment to check that it is properly ignored). However, with these changes, I think that wrapping tightly to the wrapping parameter can be achieved by changing the wrapping_boundary to use args.wrap instead of args.wrap_min. I'm still testing this on another branch, but this requires changing all the test file to account for the tighter expected wrap.

Summary of changes

The main formatting loop processes (splits and wraps) lines only by reference, allocating new memory only when applying the indentation, or when returning lines to the queue after wrapping them.
To avoid allocating when indenting only to return the line to the queue without adding it to the text, calculating the indent is separated from applying the indent. This allows moving the indent application until after the wrapping computation.
Since the indent is no longer applied to the line before wrapping it, the wrapping logic is expanded to instead take into account the calculated indent when deciding whether a line needs to be wrapped and, if so, where it should be wrapped.
Since some languages have 0-width chars (such as the French accents `,´,ˆ, etc.), I suspect that going with .chars() could cause inconsistent wrapping in non-English languages. This code introduces the unicode-width dependency to obtain the on-screen width of each character in the string and calculate a wrapping point that visually matches the wrapping parameter.
The splitting of lines is done with a new capturing regex from which the index of the matches can be extracted. This then allows returning two string slices: one containing everything before the split point, and the other containing the rest. This Regex could be easily expanded to capture other commands at which the line should be split (such as \section{...} or \[ for example).
The wrapping of lines is also done by reference by returning a string slice for the line before the wrap point, a slice containing the start of the next line (to add "% " in front of comments), and a slice containing the rest of the line.

Misc changes

Cleaning the text (removing tabs and trailing whitespace) is extracted into a function.
Deciding whether to ignore a line (based on verbatim environments or tex-fmt ignore commands) is extracted into a function, and checked before proceeding to the formatting logic.
Checking whether formatting returns to zero is done by checking the value of indent.actual after exiting the loop instead of accessing the lines of the text.
The test_source() function is modified slightly to panic on the first file that formats incorrectly and to print the file name. The computation of the diff was taking a very long time when many files were failing because of many diffs.

Performance Report

As show from the output below, these changes bring a (63.0 -> 57.1 = ) 9.3% improvement in performance on my machine.

Switched to branch 'main'
Your branch is up to date with 'origin/main'.
Building release binary
   Compiling tex-fmt v0.4.5 (/Users/cyprien/Code/tex-fmt)
    Finished `release` profile [optimized] target(s) in 11.74s
Running benchmark
Benchmark 1: tex-fmt-main
  Time (mean ± σ):      63.0 ms ±   0.7 ms    [User: 58.7 ms, System: 3.8 ms]
  Range (min … max):    62.4 ms …  72.6 ms    200 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
 
branch 'pr-refactor-logic' set up to track 'origin/pr-refactor-logic'.
Switched to a new branch 'pr-refactor-logic'
Building release binary
  Downloaded unicode-width v0.2.0
  Downloaded 1 crate (271.5 KB) in 0.21s
   Compiling unicode-width v0.2.0
   Compiling tex-fmt v0.4.5 (/Users/cyprien/Code/tex-fmt)
    Finished `release` profile [optimized] target(s) in 11.78s
Running benchmark
Benchmark 1: tex-fmt-branch
  Time (mean ± σ):      57.1 ms ±   0.5 ms    [User: 53.0 ms, System: 3.8 ms]
  Range (min … max):    56.5 ms …  61.6 ms    200 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs.
 
tex-fmt-main: 0.062972017995s
tex-fmt-branch: 0.05712572198499997s

WGUNDERWOOD · 2024-10-14T16:24:24Z

Thanks, this is very exciting! I will look through this carefully soon; your contributions are greatly appreciated.

src/format.rs

WGUNDERWOOD · 2024-10-15T10:06:11Z

Other than the comment above, this all looks good to me. Perhaps you can rebase it onto main, noting that I have split parse.rs into cli.rs and read.rs to separate the CLI logic from the file reading logic.

cdesaintguilhem · 2024-10-15T12:37:57Z

I've now merged your latest changes to main into this branch and addressed the typo.

I'm happy to squash all of these commits into one on my end, but I'm wondering whether this is something that you're also able to do on your side. Up to you if you want to decide what the merge commit should say or if you're happy for me come up with something.

WGUNDERWOOD · 2024-10-15T12:41:31Z

Yes I think it's best if you squash on your side before merging. Also if you ever want to dig deeper into performance issues, you can run the extra/prof.sh script to produce a flamegraph. Right now it seems that the .count() method in the get_diff function in indent.rs is the main candidate for improvement.

Squashed commit of the following: commit ccbc433 Author: Cyprien de Saint Guilhem <[email protected]> Date: Tue Oct 15 14:13:35 2024 +0200 Fix typo commit 0e4ec8c Merge: 52fe1f7 3f658fd Author: Cyprien de Saint Guilhem <[email protected]> Date: Tue Oct 15 14:07:14 2024 +0200 Merge remote-tracking branch 'upstream/main' into pr-refactor-logic commit 52fe1f7 Author: Cyprien de Saint Guilhem <[email protected]> Date: Mon Oct 14 22:19:16 2024 +0200 Document added regex commit 1418812 Author: Cyprien de Saint Guilhem <[email protected]> Date: Mon Oct 14 18:15:44 2024 +0200 Fix Clippy warnings commit 6c248db Author: Cyprien de Saint Guilhem <[email protected]> Date: Mon Oct 14 18:04:11 2024 +0200 Use `char::width()` for checking wrapping need commit 3f658fd Author: William G Underwood <[email protected]> Date: Mon Oct 14 15:28:00 2024 +0100 Clippy commit 0dcbc93 Author: William G Underwood <[email protected]> Date: Mon Oct 14 15:03:15 2024 +0100 Split cli and read files commit 3ca0fd2 Author: Cyprien de Saint Guilhem <[email protected]> Date: Mon Oct 14 14:09:18 2024 +0200 Remove unused function commit b4e2130 Author: Cyprien de Saint Guilhem <[email protected]> Date: Mon Oct 14 14:08:45 2024 +0200 Wrap lines based on indent length and only apply indentation after wrapping commit 9bc0a3b Author: Cyprien de Saint Guilhem <[email protected]> Date: Sun Oct 13 20:12:44 2024 +0200 Move allocation outside of wrap application commit 76d9ef0 Author: Cyprien de Saint Guilhem <[email protected]> Date: Sat Oct 12 11:17:43 2024 +0200 Separate indent calculation from application commit eb7971e Author: Cyprien de Saint Guilhem <[email protected]> Date: Sat Oct 12 10:46:56 2024 +0200 Extract indent calculation into function commit e41a66d Author: Cyprien de Saint Guilhem <[email protected]> Date: Sat Oct 12 10:31:58 2024 +0200 Apply clippy suggestions commit 21bb908 Author: Cyprien de Saint Guilhem <[email protected]> Date: Fri Oct 11 17:43:14 2024 +0200 Rename ignore check and document commit ac164ad Author: Cyprien de Saint Guilhem <[email protected]> Date: Fri Oct 11 17:13:31 2024 +0200 Simplify Option matching commit 8a6e868 Author: Cyprien de Saint Guilhem <[email protected]> Date: Fri Oct 11 12:58:12 2024 +0200 Remove unnecessary comment check and `Option` in return type when splitting lines commit 1d15b86 Author: Cyprien de Saint Guilhem <[email protected]> Date: Fri Oct 11 12:57:17 2024 +0200 Change comments around commit e77b629 Author: Cyprien de Saint Guilhem <[email protected]> Date: Fri Oct 11 12:56:46 2024 +0200 Refactor ignoring check as function commit b10b010 Author: Cyprien de Saint Guilhem <[email protected]> Date: Fri Oct 11 11:37:18 2024 +0200 Move environment line splitting before indenting This avoids re-adding the start of the split line to the queue. commit 0424d0a Author: Cyprien de Saint Guilhem <[email protected]> Date: Fri Oct 11 11:23:01 2024 +0200 Move check for ignore outside of `apply_indent()` commit 70ac614 Author: Cyprien de Saint Guilhem <[email protected]> Date: Fri Oct 11 09:29:55 2024 +0200 Use new return type for `put_env_new_line` commit fdea16b Author: Cyprien de Saint Guilhem <[email protected]> Date: Fri Oct 11 09:28:54 2024 +0200 Fix infinite loop issue when split points are in comments commit 791c7f1 Author: Cyprien de Saint Guilhem <[email protected]> Date: Fri Oct 11 09:27:23 2024 +0200 Correct to not expect `{` after `\item` but only after `\begin` and `\end` commit 6e5ce6c Author: Cyprien de Saint Guilhem <[email protected]> Date: Fri Oct 11 09:26:06 2024 +0200 Match target tex file to source file commit 9562af1 Author: Cyprien de Saint Guilhem <[email protected]> Date: Thu Oct 10 22:27:53 2024 +0200 Remove debug print statement commit 0f48547 Author: Cyprien de Saint Guilhem <[email protected]> Date: Thu Oct 10 18:10:00 2024 +0200 Split environment and item lines without allocating This does not compile because the `format` function cannot yet deal with the new return type. commit c49a08c Author: Cyprien de Saint Guilhem <[email protected]> Date: Thu Oct 10 13:27:48 2024 +0200 Check indentation return to 0 on `state` instead of `new_text` commit 1b90f21 Author: Cyprien de Saint Guilhem <[email protected]> Date: Thu Oct 10 13:11:57 2024 +0200 Split condition tree into distinct blocks with `continue` commit e974225 Author: Cyprien de Saint Guilhem <[email protected]> Date: Thu Oct 10 11:07:08 2024 +0200 Plan line formatting logic steps commit 4bf802a Author: Cyprien de Saint Guilhem <[email protected]> Date: Thu Oct 10 10:53:42 2024 +0200 Group lines to clean text into one function commit 380380d Author: William G Underwood <[email protected]> Date: Thu Oct 10 14:33:04 2024 +0100 Release process notes

cdesaintguilhem · 2024-10-15T13:03:00Z

Squash completed.

I've kept the messages in the squash commit's message; and I've kept a branch on my fork with the individual commits.

WGUNDERWOOD · 2024-10-15T14:12:24Z

Could you add an example test of where the old code would fail on zero-width chars? I tried an example with a French accent in tests/source/unicode.tex, but using .width() doesn't seem necessary for this one to pass.

cdesaintguilhem · 2024-10-15T17:21:25Z

As a first attempt: the following line (with two spaces before "Aujourd'hui" and where the 70th visual column is between "vérifier" and "is")

y̆es  Aujourd'hui je me suis baladé déhors. Hier je suis allé vérifier si l'été était arrivé.

is wrapped by tex-fmt 0.4.5 to

y̆es  Aujourd'hui je me suis baladé déhors. Hier je suis allé
vérifier si l'été était arrivé.

So it's wrapped at the 61-st visual place, instead of the expected 70th (because of wrap_min)

However, running the current main with cargo run wraps it to

y̆es  Aujourd'hui je me suis baladé déhors. Hier je suis allé vérifier
si l'été était arrivé.

The issue with what I said in the PR is that French accents are not 0-width characters; indeed letters like é are single characters in their own right, rather than being e + ´. However, introducing the (Hungarian?) letter y̆ causes the difference in behaviour since this is represented with two separate characters.

If there is a performance loss in using the function from unicode-width it might be worth removing it given that it might not affect as many languages as I thought.

WGUNDERWOOD · 2024-10-15T21:10:56Z

I do see a slight performance degradation using unicode-width. I'll look into it a little more to see how likely we are to encounter problems if we remove it, but my guess is that we don't need to worry about this too much, as LaTeX is designed so that most accents can be expressed using standard characters, such as {\'e} for example.

cdesaintguilhem · 2024-10-16T06:28:54Z

That's a good point about how LaTeX is meant to offer other ways for the user to express accents, without assuming that the source is UTF-8 encoded. Some of the scripts for which I suspect there could be more 0-width characters would include Mandarin, Hindi, Arabic or maybe Korean or Japanese but I don't know enough about these to check. (I also don't know how much LaTeX is used with these languages.)

One thought I had would be to extract the code that calculates a line's length into a function, and have a feature (something like true-unicode) which controls how that function is implemented (using unicode-width::width() when the feature is enabled, or just .char_indices() when not).

WGUNDERWOOD · 2024-10-16T10:24:32Z

A list of characters for which the width needs modifying is given in the unicode_width crate docs. My impression is that these will be rarely encountered, so I'll remove this dependency for now.

WGUNDERWOOD reviewed Oct 15, 2024

View reviewed changes

src/format.rs Outdated Show resolved Hide resolved

cdesaintguilhem changed the base branch from develop to main October 15, 2024 12:07

cdesaintguilhem force-pushed the pr-refactor-logic branch from ccbc433 to db634ec Compare October 15, 2024 13:01

WGUNDERWOOD merged commit f73426c into WGUNDERWOOD:main Oct 15, 2024
9 checks passed

cdesaintguilhem deleted the pr-refactor-logic branch October 15, 2024 13:46

WGUNDERWOOD mentioned this pull request Oct 15, 2024

Create test files for moving sectioning commands to their own line #36

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative formatting processing #42

Alternative formatting processing #42

cdesaintguilhem commented Oct 14, 2024

WGUNDERWOOD commented Oct 14, 2024 •

edited

Loading

WGUNDERWOOD commented Oct 15, 2024

cdesaintguilhem commented Oct 15, 2024

WGUNDERWOOD commented Oct 15, 2024

cdesaintguilhem commented Oct 15, 2024

WGUNDERWOOD commented Oct 15, 2024

cdesaintguilhem commented Oct 15, 2024

WGUNDERWOOD commented Oct 15, 2024

cdesaintguilhem commented Oct 16, 2024

WGUNDERWOOD commented Oct 16, 2024

Alternative formatting processing #42

Alternative formatting processing #42

Conversation

cdesaintguilhem commented Oct 14, 2024

Summary of changes

Misc changes

Performance Report

WGUNDERWOOD commented Oct 14, 2024 • edited Loading

WGUNDERWOOD commented Oct 15, 2024

cdesaintguilhem commented Oct 15, 2024

WGUNDERWOOD commented Oct 15, 2024

cdesaintguilhem commented Oct 15, 2024

WGUNDERWOOD commented Oct 15, 2024

cdesaintguilhem commented Oct 15, 2024

WGUNDERWOOD commented Oct 15, 2024

cdesaintguilhem commented Oct 16, 2024

WGUNDERWOOD commented Oct 16, 2024

WGUNDERWOOD commented Oct 14, 2024 •

edited

Loading