-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements : Speed #3
Comments
Thank you for issue.
|
I tried to replace |
In rust, hashmaps themselves can use an alternative hashing method for speed or other reasons.
This is what the documentation says :
Also sort on vec is Unstable sort is only relevant if there are other indexes to consider. Atleast for the sorted indexes it should yield the same result. In case different results are not desirable, I think it would be a good idea to have a unit or integration test to verify expected behavior. |
OK, I will push sort_unstable to repo. Concerning the hashing function - we need to try to change it. There are a lot of hash operations because of LRU cache, it can help! |
hm, it looks like I tested it wrongly and unstable sort speed up it for 10-20%. |
Some recommendations for hash functions :
If I had to choose, I would have chosen highwayhash. |
Actually it's good question. I tested standard hasher on big file and it was slow. And I added blake3 hasher (not sure if we need fingerprint, as it is in Python version, at all for comparison, maybe we can compare just decoded_payload) charset-normalizer-rs/src/entity.rs Line 178 in 4d9fd71
|
#14 implements aHash. For pure speed with little other considerations, it is good enough. As such for short term, further |
Hashing is used only for caching, so it doesn't matter that it's unstable. Also, we don't serialize hashes via serde. |
I did some experiments with Rayon. At this time it looks like it has high overhead. So I plan to try to rewrite mess measurement code in more efficient way (I have some ideas). |
I was looking into the Unfortunately, as they are it does not seem like they are easily written into functional Rayon code (if at all) I agree that some other parts of the code are not good candidates for Rayon due to overhead. Hope you find a good approach! |
I'm exploring integrating encoding_rs instead of rust-encoding. rust-encoding seems to be abandoned and unmaintained. encoding_rs is better supported. It would obviously take some time. |
Yes, it doesn't fit to the concept of this library. |
It looks like new mess detector approach is faster #31
However I broke a lot of idiomatic code in ms.rs. |
From the list, only If there are any ideas, let me know. In the meantime, feel free to close this comment. |
Yes, I have an idea, to refactor and speed up |
As per our discussion in #2 for speed improvements the following has been suggested
The paths I had in mind was these:
Rayon
DashMap
(Currentstd
HashMap implements rayon so not strictly necessary, but might be useful to look into regardless)Use replace hashing algorithm used in HashMapwithFxHash
,AHash
,HighwayHash
ReplaceSorting was changed to unstable #6sort()
withsort_unstable()
Identfiy preallocation opportunities. For instance, replaceVec::new()
withVec::with_capacity()
Many of these are low hanging fruit and related to refactoring the code to idiomatic Rust code.
For example, there are many for loops in this code. Iterator based code is more idiomatic, easier to improve with
rayon
, and interact better with allocation. (pushing items from within a for loop can cause multiple allocs and copies, while collecting an iterator can allow fewer allocations.)The text was updated successfully, but these errors were encountered: