Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up Unicode Data lookups and remove reflection. #281

Merged
merged 8 commits into from
Jun 29, 2022

Conversation

JimBobSquarePants
Copy link
Member

@JimBobSquarePants JimBobSquarePants commented Jun 24, 2022

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following matches the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

This PR does a couple of things.

I chose to remove the compression from the trie packing for performance since the binaries do not grow out of hand and the benefit is great. @Gillibald maybe this is something to consider for Avalonia?

There's no before/after benchmarks unfortunately because I added them as an afterthought, however I took a sampling snapshot early in the process. As you can see, loading Unicode data is no longer a hotspot.

Before
image

After
image

I don't want to go crazy optimizing much further in this PR but will have a good scan through the sampling data to see if there's some low hanging fruit at another time.

CC @Sergio0694 @kant2002 @0xced

@codecov
Copy link

codecov bot commented Jun 24, 2022

Codecov Report

Merging #281 (7438b74) into main (1b80769) will increase coverage by 1%.
The diff coverage is 99%.

@@          Coverage Diff           @@
##            main    #281    +/-   ##
======================================
+ Coverage     81%     83%    +1%     
======================================
  Files        215     222     +7     
  Lines      11278   12200   +922     
  Branches    1757    1756     -1     
======================================
+ Hits        9244   10167   +923     
+ Misses      1607    1606     -1     
  Partials     427     427            
Flag Coverage Δ
unittests 83% <99%> (+1%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...bors.Fonts/Unicode/Resources/BidiTrie.Generated.cs 100% <ø> (ø)
....Fonts/Unicode/Resources/GraphemeTrie.Generated.cs 100% <ø> (ø)
...Fonts/Unicode/Resources/LineBreakTrie.Generated.cs 100% <ø> (ø)
...rs.Fonts/Unicode/Resources/ScriptTrie.Generated.cs 100% <ø> (ø)
...Unicode/Resources/UnicodeCategoryTrie.Generated.cs 100% <ø> (ø)
src/SixLabors.Fonts/Unicode/UnicodeTrieBuilder.cs 85% <50%> (ø)
src/SixLabors.Fonts/Unicode/BidiClass.cs 62% <100%> (ø)
src/SixLabors.Fonts/Unicode/CodePoint.cs 91% <100%> (+<1%) ⬆️
src/SixLabors.Fonts/Unicode/JoiningClass.cs 100% <100%> (ø)
...s/Unicode/Resources/ArabicShapingTrie.Generated.cs 100% <100%> (ø)
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b80769...7438b74. Read the comment docs.

@Gillibald
Copy link

Nice work @JimBobSquarePants

@kant2002
Copy link
Contributor

I'm not sure how do you get trie files, but I think it's better to keep source trie files in source control and document generation process. I understand that nobody want run file generation process on the build, but that process easy to forget, and maybe somebody would like to improve it, so better to have source files around.
Or maybe I miss something in the changes ?

@JimBobSquarePants
Copy link
Member Author

Thanks @Gillibald your idea to create classes was a great one!

@kant2002 The trie files are created via the Generator project and simply contained a header and the parsed bytes. That project reads a bunch of Unicode spec data files and creates the output.

Here's the project in the main branch.
https://github.com/SixLabors/Fonts/tree/main/src/UnicodeTrieGenerator

If you look at Generator.cs in this PR we're simply switching out the file generation, instead opting to create classes containing the trie data. All the source files required to create those classes are in the solution.

We definitely do not need to build them every time, though the process is pretty quick since the spec files are stored locally. We actually won't need to build these again until Unicode 15 is released and that will require a dedicated PR to do so there's no opportunity to forget.

@Sergio0694
Copy link
Member

Very nice! I assume this will also be easier to maintain than a locally referenced source generator, so we can shelf that idea for now. I think this approach makes sense given you're now leveraging this to also do some more custom preprocessing when generating the blobs, and not just serializing an already existing file. Very nice work James! 😄

@kant2002
Copy link
Contributor

@JimBobSquarePants thanks for explanation. Yeah, now it make sense to me. Thanks you for caring about such niche workloads 😄

@Gillibald
Copy link

I will adapt your changes for Avalonia. Really appreciate your work.

@JimBobSquarePants
Copy link
Member Author

Imma gonna merge this.

@JimBobSquarePants JimBobSquarePants merged commit fc97cb1 into main Jun 29, 2022
@JimBobSquarePants JimBobSquarePants deleted the js/perf-tweaks branch June 29, 2022 09:13
@DaZombieKiller
Copy link
Contributor

I just thought it'd be worth pointing out that the UnicodeTrie-related code won't work on big endian systems (and will potentially fail or throw DataMisalignedException on systems where aligned access is required). Something more along the lines of this would solve both of those issues:

struct UnicodeTrieHeader
{
    public int HighStart;
    public uint ErrorValue;
    public int DataLength;
}

// ...

readonly uint[] data;
readonly int highStart;
readonly uint errorValue;

public UnicodeTrie(ReadOnlySpan<byte> rawData)
{
    // MemoryMarshal.Read instead of MemoryMarshal.Cast/Unsafe.As
    // because it performs an unaligned read and a bounds check.
    var header = MemoryMarshal.Read<UnicodeTrieHeader>(rawData);

    if (!BitConverter.IsLittleEndian)
    {
        header.HighStart  = BinaryPrimitives.ReverseEndianness(header.HighStart);
        header.ErrorValue = BinaryPrimitives.ReverseEndianness(header.ErrorValue);
        header.DataLength = BinaryPrimitives.ReverseEndianness(header.DataLength);
    }

    highStart  = header.HighStart;
    errorValue = header.ErrorValue;
    data       = new uint[header.DataLength / sizeof(uint)];

    // Avoid MemoryMarshal.Cast because we don't know if the rawData buffer is aligned
    // to uint, which would potentially throw DataMisalignedException if it weren't.
    rawData[Unsafe.SizeOf<UnicodeTrieHeader>()..].CopyTo(MemoryMarshal.AsBytes(data.AsSpan()));

    if (!BitConverter.IsLittleEndian)
    {
        for (int i = 0; i < data.Length; i++)
        {
            data[i] = BinaryPrimitives.ReverseEndianness(data[i]);
        }
    }
}

@JimBobSquarePants
Copy link
Member Author

JimBobSquarePants commented Jun 29, 2022

We know that the buffer is aligned and we also know the endianness though? We create the data.

@DaZombieKiller
Copy link
Contributor

I wasn't able to find any actual usages of the ReadOnlySpan<byte> constructor, so that wasn't immediately clear to me and it seemed like a potential oversight. If that's the case then you can disregard my points above.

@TechPizzaDev
Copy link
Contributor

The data is coming from the ReadOnlySpan<byte> constants isn't it? That means that casting the bytes (which are most likely little-endian) into integers will break on big-endian.

@JimBobSquarePants
Copy link
Member Author

Ooh. You're right! The endianness definitely matters since it's a compiled constant. I'll update unless you want to create a PR @DaZombieKiller ?

@DaZombieKiller
Copy link
Contributor

@JimBobSquarePants Sure! Opened one here: #284

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UnicodeData class is not working without reflection
7 participants