Report carve #891

AndrewFasano · 2024-07-02T16:40:48Z

Adds a new report type of CarveReport and reports when files are carved to fix Some directories are not reported #554
Changes the directory suffix used by carve to be _carve by default to fix distinguish between _carve and _extract directory #326

I really wanted to change carving to generate a subtask where there's a task for the original file being processed that has subtasks for each carved file. Then each carved file is another task that generates subtasks for the extraction. But I couldn't find a sane way to do this - would love any feedback or help with it if y'all think that's a better way to approach this.

e3krisztian · 2024-07-03T23:05:32Z

The addition of CarveReport looks clean enough (will have a second look to verify).

However the change from _extract to _carve directory suffix is backward incompatible, and while generally I like the idea (I have opened the linked issue it solves), I think it would be better for the default for ExtractionConfig.carve_suffix to be _extract (=no change to current output), and make it a command line option to override it.

I had about the same idea of simplifying the double step extraction (carve then extract chunks). I thought, if a file is not fully recognized by any handler (=has multiple chunks), it should be categorized as "unknown" (or "composite") and handled by a "default" handler, which would recognize and extract (carve) chunks, and could also assign handlers to them. This would be exactly what you wanted - one task for each file. I went so far as to make an experimental refactor to work like this 2 years ago, but it become too big of a change with untidy commits to review, and also probably with some bad decisions and thus was abandoned without much consideration. One of the problems to solve is how to pass the handler between processes to avoid the duplication of the expensive handler selection. With the current solution it is not needed: handler selection and extraction happens in the same process.

I do think understanding and reasoning about a flattened extraction process would be easier, but it would be a big work to rewrite the code now.

Could you explain why you would like to handle chunk-files by separate (sub-)tasks? Maybe we can come up with a solution for that problem.

AndrewFasano · 2024-07-08T13:30:41Z

Thanks for the feedback. I updated the PR to set the default carve_suffix to be _extract. I also fixed a type issue pyright caught, hopefully it will pass the CI checks now.

Thanks for pointing me at #464 - I can now see how complex that change would be and will avoid going down that path! What I'm trying to accomplish here is described in #878, but the short version is that I want to collect a clean version of each extraction produced by unblob. Specifically I'm trying to recover multiple partitions from firmware and package them up into archives for subsequent analysis. I don't want to have any *_extract directories or the ####-####.<type> files created by unblob in the packaged archives.

I've managed to do this using a terrible incantation of find to identify *_extract directories and then run tar on each with various --exclude flags to filter out unblob artifacts within the directory, but I figure there must be a better way either using the unblob API or by parsing the extraction report. If you have any tips, please let me know.

qkaiser · 2024-09-24T07:38:41Z

I'm back ! Did anything happen with this @e3krisztian ?

e3krisztian · 2024-09-24T08:27:31Z

I'm back ! Did anything happen with this @e3krisztian ?

Welcome back @qkaiser !
There was no progress on it, unfortunately.

I think this carve suffix needs to be configurable from the CLI, and did not got around to it.

e3krisztian · 2024-10-26T17:18:28Z

There are 2 features added in this PR (which if I understand properly could each solve the problem of some directories not being reported):

report carving
make carve suffix configurable

I think it would be cleaner as 2 independent PR-s, or choosing only one of them.

I have made an attempt at adding the command line, rebase, do some cleanups, then to look at it again (see https://github.com/onekey-sec/unblob/tree/report_carve), and do some local experiments.

Based on the experiment results, I would go with reporting the carves (as the title says), as it is almost there, only 2 tests need to be modified to know about the new report type.

Unfortunately the suffix change has bigger problems:

extract root is by default the current directory, and it is the carve/extract directory (which was the same) that is checked if exists (do we need --force to delete it?). This somewhat worked so far, and we could find a firmware.bin_extract directory at the current directory when extracting firmware.bin even in a non-empty current directory. Now this would change, as we would either find firmware.bin_extract or firmware.bin_carves (_carves used for suffix) depending on unblob's recognition of the file format, and existence of extra padding. This I think is confusing. It would be cleaner to require a non-existing root directory, but that is not backward compatible.
currently extraction and carving shares the same _extract suffix, which somewhat makes things easy, but when these carve and extract directories are derived differently, the internal variable naming becomes a total mess (one might argue, that the mess exists now) E.g. this line needs some work to get right: self.carve_dir = config.get_extract_dir_for(self.task.path)

(the above is not strictly what I saw, but it is probably because of some further mixing up of carve and extraction suffix)

qkaiser · 2024-10-29T07:36:23Z

There are 2 features added in this PR (which if I understand properly could each solve the problem of some directories not being reported):

report carving

make carve suffix configurable

I think it would be cleaner as 2 independent PR-s, or choosing only one of them.

Agree. Let's make 2 independent merge requests.

I have made an attempt at adding the command line, rebase, do some cleanups, then to look at it again (see https://github.com/onekey-sec/unblob/tree/report_carve), and do some local experiments.

Based on the experiment results, I would go with reporting the carves (as the title says), as it is almost there, only 2 tests need to be modified to know about the new report type.

Unfortunately the suffix change has bigger problems:

extract root is by default the current directory, and it is the carve/extract directory (which was the same) that is checked if exists (do we need --force to delete it?). This somewhat worked so far, and we could find a firmware.bin_extract directory at the current directory when extracting firmware.bin even in a non-empty current directory. Now this would change, as we would either find firmware.bin_extract or firmware.bin_carves (_carves used for suffix) depending on unblob's recognition of the file format, and existence of extra padding. This I think is confusing. It would be cleaner to require a non-existing root directory, but that is not backward compatible.

It's confusing if we don't explain anything to the end user. I would be okay with this state of things if the non-verbose output would display an "output directory" line with the absolute path to where we put the extracted/carved content. We can also provide context in the interactive help and the online documentation.

currently extraction and carving shares the same _extract suffix, which somewhat makes things easy, but when these carve and extract directories are derived differently, the internal variable naming becomes a total mess (one might argue, that the mess exists now) E.g. this line needs some work to get right: self.carve_dir = config.get_extract_dir_for(self.task.path)

I would definitely argue that the mess exists now. We could move to a more generic naming like get_output_dir_for if it helps.

e3krisztian · 2024-11-22T15:43:06Z

unblob/report.py

+            carved_to=carved_to,
+            start_offset=chunk.start_offset,
+            end_offset=chunk.end_offset,
+            handler_name=chunk.handler.NAME,


I had another look at this implementation and its report-output, and I do not see how duplicating the chunk attributes is useful.

This is also storing the path to the carved file, while what was missing is some record for the parent directory, which is created to hold the carved files and their extractions.

Maybe I miss some use-case for the stored attributes, but I am inclined to drop this CarveReport and introduce instead a CarveDirectoryReport, with the missing carve directory path.

e3krisztian · 2024-11-26T21:58:26Z

@AndrewFasano thank you for the initial implementation.
Closing in favor of #1017, which still has one of your commits unchanged (8024aed).
I hope that implementation would also work for you, please make comments there if not.

AndrewFasano mentioned this pull request Jul 2, 2024

Some directories are not reported #554

Open

e3krisztian self-requested a review July 3, 2024 21:08

AndrewFasano force-pushed the report_carve branch from cdcbef8 to 8c88d16 Compare July 8, 2024 13:18

Andrew Fasano added 2 commits July 8, 2024 15:26

feat(report): Report carved files

81cfd17

feat(processing): Support configurable carve suffix

da9138a

AndrewFasano force-pushed the report_carve branch from 8c88d16 to da9138a Compare July 8, 2024 19:27

e3krisztian reviewed Nov 22, 2024

View reviewed changes

e3krisztian mentioned this pull request Nov 26, 2024

Report carve dir #1017

Open

e3krisztian closed this Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report carve #891

Report carve #891

AndrewFasano commented Jul 2, 2024

e3krisztian commented Jul 3, 2024

AndrewFasano commented Jul 8, 2024

qkaiser commented Sep 24, 2024

e3krisztian commented Sep 24, 2024

e3krisztian commented Oct 26, 2024

qkaiser commented Oct 29, 2024

e3krisztian Nov 22, 2024

e3krisztian commented Nov 26, 2024

Report carve #891

Report carve #891

Conversation

AndrewFasano commented Jul 2, 2024

e3krisztian commented Jul 3, 2024

AndrewFasano commented Jul 8, 2024

qkaiser commented Sep 24, 2024

e3krisztian commented Sep 24, 2024

e3krisztian commented Oct 26, 2024

qkaiser commented Oct 29, 2024

e3krisztian Nov 22, 2024

Choose a reason for hiding this comment

e3krisztian commented Nov 26, 2024