Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oxidize part 1 ... an Intro to an ambitious vNext ? #22

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

jrouaix
Copy link
Contributor

@jrouaix jrouaix commented Aug 20, 2024

Hi,

I was looking to implement a "simple" LogDataIterator but I couldn't make sens about all the &mut structs that are passed around just to pick some value in it when the function exit. => shindan-io@2a6e543

And so while digging I found a lot to improve in the current code base and started some big rewrites.

  • avoid mut
  • better nom usage => a lot shorter code
  • avoid &ref on Copy types
  • const when consts

I really think the code could be way more maintainable, and it could help a lot to implement other features, or other apis/usages around this lib.

Would you have a look at this PR ?
If you like it, I'll have a lot more to provide.

I hope I didn't do any regression, but I don't have the same test data to ensure.

We (myself & @mrguiman) think we can really help improving this lib (memory consumption & perhaps some perf).
And we (Shindan) rely on it, so we have a green light from our employer to contribute.

To you have a plan to version 2 ?

Perhaps it could be a good moment to think about a roadmap, we could share the heavy lifting on adding features while also oxidize (aka : make the code Rusty) the rest of the existing code.

let us know @puffyCid

@jrouaix
Copy link
Contributor Author

jrouaix commented Aug 23, 2024

added some more : dsc & uuidtext

@puffyCid
Copy link
Collaborator

thanks @jrouaix for the PR.
I would like to merge #20 first before doing an in-depth review. #20 Adds some CI support which hopefully will catch any possible regressions (fingers crossed).

Couple comments/questions from brief look:

  1. [test_data] feature. Was this added due to the confusion on tests? As mentioned in BUILDING.md. The tests.zip file can be downloaded from the GitHub release page and extracted to your cloned repo. You should be able to then run all the tests. A macOS systems is required to run some. I think ideally all tests should be run by default. They should not take long.
    However, if you want, i think perhaps changing [test_data] to something like [test_live_system] and adding that to the tests:
test_collect_strings_system()
test_collect_timesync_system()
test_collect_shared_strings_system()

Could be useful. These test require a macOS system, hiding these behind a feature flag would be helpful for Linux and Windows users who try to run the tests
2. The conversion from let to const for some variables. Is this mainly, a change for readability (using uppercase variable names)?
3. Avoiding &ref on Copy types. I usually, pass by reference when possible. Just curious on the change to Copy. Is that again a readability change or just trying to make the code more Rusty?
4. Adding the anyhow dependency. I see this only for tests? Just curious why the additional dependency? I usually try to aim for minimal dependencies. Since this additional dev-dependency not a big issue. Just curious what is it helping with?

In regards to library Roadmap. The major things that come to mind:

  1. Support macOS Sequoia. Hopefully, Additional Item Types #20 completes this.

  2. Memory and Performance improvements. This was my first large/complex Rust project. Looking back definitely some things that could probably be done differently. If you see opportunities that could make the project more Rusty or improvement memory/performance. Definitely interested in changes or ideas

  3. Support more custom objects in logs. Currently the library supports a variety of objects/structs in the log data ([decoders])(https://github.com/mandiant/macos-UnifiedLogs/tree/main/src/decoders). However, not all are supported (ex: Support dnsinfo and nwi custom objects #10). I have been busy with other code projects and have not had time to tackle some of them

Are there any other things you think could be worth adding/changing?

@jrouaix
Copy link
Contributor Author

jrouaix commented Aug 26, 2024

Hey @puffyCid, thank you for the answer

Let's answer on my side:
1 =>

  • my bad, I didn't RTFM, so I though the test_data was kept private, and I was wrong, so i'll remove the test_data feature I used to filter out thoses tests.
  • for the mac os only tests, a simple #[cfg(target_os = "macos")] will do the trick
  • it seems my rewrite has a lot of regressions, I did not expect so much, will debug it before you have a look

2 =>

  • yes, const and SCREAMING_CASE are the idiomatic way to store constant values in rust, i'm not sure is has any perf improvement in this specific case, but at least you get to know it's a compare to constant (so kinda static behavior instead or more dynamic behavior) when reading

3 =>

4 =>

  • yes you right about avoiding dependencies, specially anyhow won't help in a library, it's more Result type used in end projects (or tests).
  • for libs it's recommended to implement a specifiq Error type (thiserror crate is super usefull for that, but can do it by hand)
  • My preference for using anyhow in tests is that I try to avoid having code that can panic, even in tests, so I avoid using .unwrap() or expect("...")
  • so anyhow in the tests allow me to simply use the nice ? operator.

Now about the Roadmap, I hope I can help a lot on memory & perfs. This is you first project ? You @puffyCid are Mr A. Hlcmb ?

And we had some ideas :

iter over files

having low level iterators could reduce memory consumption a lot

zero copy

We could (not sure it's possible) just have low level iterator of LogData<'lifetime_of_file_byte_array>.
Such a parse would be very lightweight, check constraints but be super fast.

lazy formatting

Assuming the previous one LogData could just embed refs to what is needed to format messages
and allocate copies of the data (when needed). We could allow formatting with a .message() function.
This would allow us to have some use case like "scan throught the unified log looking for some message format (&str) and extract values only when match"

Perhaps I should write and issue for each of thoses, in order to get comments

@jrouaix jrouaix marked this pull request as draft August 26, 2024 13:32
@jrouaix
Copy link
Contributor Author

jrouaix commented Nov 21, 2024

abandonned PR, too big, let's make it more digest for the maintainers : #33

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants