oxidize part 1 ... an Intro to an ambitious vNext ? #22

jrouaix · 2024-08-20T13:36:55Z

Hi,

I was looking to implement a "simple" LogDataIterator but I couldn't make sens about all the &mut structs that are passed around just to pick some value in it when the function exit. => shindan-io@2a6e543

And so while digging I found a lot to improve in the current code base and started some big rewrites.

avoid mut
better nom usage => a lot shorter code
avoid &ref on Copy types
const when consts

I really think the code could be way more maintainable, and it could help a lot to implement other features, or other apis/usages around this lib.

Would you have a look at this PR ?
If you like it, I'll have a lot more to provide.

I hope I didn't do any regression, but I don't have the same test data to ensure.

We (myself & @mrguiman) think we can really help improving this lib (memory consumption & perhaps some perf).
And we (Shindan) rely on it, so we have a green light from our employer to contribute.

To you have a plan to version 2 ?

Perhaps it could be a good moment to think about a roadmap, we could share the heavy lifting on adding features while also oxidize (aka : make the code Rusty) the rest of the existing code.

let us know @puffyCid

…ta files)

jrouaix · 2024-08-23T11:55:36Z

added some more : dsc & uuidtext

puffyCid · 2024-08-25T21:06:30Z

thanks @jrouaix for the PR.
I would like to merge #20 first before doing an in-depth review. #20 Adds some CI support which hopefully will catch any possible regressions (fingers crossed).

Couple comments/questions from brief look:

[test_data] feature. Was this added due to the confusion on tests? As mentioned in BUILDING.md. The tests.zip file can be downloaded from the GitHub release page and extracted to your cloned repo. You should be able to then run all the tests. A macOS systems is required to run some. I think ideally all tests should be run by default. They should not take long.
However, if you want, i think perhaps changing [test_data] to something like [test_live_system] and adding that to the tests:

test_collect_strings_system()
test_collect_timesync_system()
test_collect_shared_strings_system()

Could be useful. These test require a macOS system, hiding these behind a feature flag would be helpful for Linux and Windows users who try to run the tests
2. The conversion from let to const for some variables. Is this mainly, a change for readability (using uppercase variable names)?
3. Avoiding &ref on Copy types. I usually, pass by reference when possible. Just curious on the change to Copy. Is that again a readability change or just trying to make the code more Rusty?
4. Adding the anyhow dependency. I see this only for tests? Just curious why the additional dependency? I usually try to aim for minimal dependencies. Since this additional dev-dependency not a big issue. Just curious what is it helping with?

In regards to library Roadmap. The major things that come to mind:

Support macOS Sequoia. Hopefully, Additional Item Types #20 completes this.
Memory and Performance improvements. This was my first large/complex Rust project. Looking back definitely some things that could probably be done differently. If you see opportunities that could make the project more Rusty or improvement memory/performance. Definitely interested in changes or ideas
Support more custom objects in logs. Currently the library supports a variety of objects/structs in the log data ([decoders])(https://github.com/mandiant/macos-UnifiedLogs/tree/main/src/decoders). However, not all are supported (ex: Support dnsinfo and nwi custom objects #10). I have been busy with other code projects and have not had time to tackle some of them

Are there any other things you think could be worth adding/changing?

jrouaix · 2024-08-26T13:32:15Z

Hey @puffyCid, thank you for the answer

Let's answer on my side:
1 =>

my bad, I didn't RTFM, so I though the test_data was kept private, and I was wrong, so i'll remove the test_data feature I used to filter out thoses tests.
for the mac os only tests, a simple #[cfg(target_os = "macos")] will do the trick
it seems my rewrite has a lot of regressions, I did not expect so much, will debug it before you have a look

2 =>

yes, const and SCREAMING_CASE are the idiomatic way to store constant values in rust, i'm not sure is has any perf improvement in this specific case, but at least you get to know it's a compare to constant (so kinda static behavior instead or more dynamic behavior) when reading

3 =>

you're right, passing refs is a good way to share data to a function.
unless the data passed is so small that you don't really get any perf improvement not cloning it (you copy a ref, and have to deref to read it, it's slower than copying a usize).
that's why when a struct it super small, we just mark it Copy and pass it around wihout bothering about refs anymore, then the compiler can .clone() it instead of moveing it : https://doc.rust-lang.org/stable/std/marker/trait.Copy.html / https://doc.rust-lang.org/book/appendix-03-derivable-traits.html?highlight=copy#clone-and-copy-for-duplicating-values

4 =>

yes you right about avoiding dependencies, specially anyhow won't help in a library, it's more Result type used in end projects (or tests).
for libs it's recommended to implement a specifiq Error type (thiserror crate is super usefull for that, but can do it by hand)
My preference for using anyhow in tests is that I try to avoid having code that can panic, even in tests, so I avoid using .unwrap() or expect("...")
so anyhow in the tests allow me to simply use the nice ? operator.

Now about the Roadmap, I hope I can help a lot on memory & perfs. This is you first project ? You @puffyCid are Mr A. Hlcmb ?

And we had some ideas :

iter over files

having low level iterators could reduce memory consumption a lot

zero copy

We could (not sure it's possible) just have low level iterator of LogData<'lifetime_of_file_byte_array>.
Such a parse would be very lightweight, check constraints but be super fast.

lazy formatting

Assuming the previous one LogData could just embed refs to what is needed to format messages
and allocate copies of the data (when needed). We could allow formatting with a .message() function.
This would allow us to have some use case like "scan throught the unified log looking for some message format (&str) and extract values only when match"

Perhaps I should write and issue for each of thoses, in order to get comments

jrouaix · 2024-11-21T15:37:24Z

abandonned PR, too big, let's make it more digest for the maintainers : #33

jrouaix added 13 commits August 20, 2024 10:07

LogPreamble parsing refacto

bceb643

HeaderChunk refacto

9e2a015

catalogchunk parse refacto

2908a57

test_data feature (to enable running only the test no needing test da…

d730e5b

…ta files)

there was no unit tests for this so I fear test data will fail

8d4f643

let consts be consts

3235e5a

do not &ref Copy types

d539ab0

Merge branch 'main' into some_rewrites

05759b5

debug after merge

5eb94a9

UUIDText parser rewrite

61631d5

back to ""

8efff8d

crafted a unit test for dsc

355fa04

dsc parser rewritten

ba9f5a2

this line really didn't fill any purpose

bd59297

jrouaix added 3 commits August 26, 2024 12:23

dsc tests are back

c58b8ee

#[cfg(target_os = "macos")] on 3 tests

235ab8e

test_data feature was a bad idea

b658bfa

jrouaix marked this pull request as draft August 26, 2024 13:32

jrouaix added 2 commits August 26, 2024 16:36

rolled back to parse all chunks with reparse preamble

f5d6586

debugged catalog

0e1dc28

puffyCid mentioned this pull request Sep 24, 2024

Attempt at UnifiedLog Iterator #26

Merged

a little const

5631042

jrouaix mentioned this pull request Nov 21, 2024

shorter logpreamble parse implementation by more nom combinators usage #33

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oxidize part 1 ... an Intro to an ambitious vNext ? #22

oxidize part 1 ... an Intro to an ambitious vNext ? #22

jrouaix commented Aug 20, 2024

jrouaix commented Aug 23, 2024

puffyCid commented Aug 25, 2024

jrouaix commented Aug 26, 2024

jrouaix commented Nov 21, 2024

oxidize part 1 ... an Intro to an ambitious vNext ? #22

Are you sure you want to change the base?

oxidize part 1 ... an Intro to an ambitious vNext ? #22

Conversation

jrouaix commented Aug 20, 2024

jrouaix commented Aug 23, 2024

puffyCid commented Aug 25, 2024

jrouaix commented Aug 26, 2024

iter over files

zero copy

lazy formatting

jrouaix commented Nov 21, 2024