Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dotnet: add support for basic blocks #1326

Closed
wants to merge 5 commits into from
Closed

Conversation

mike-hunhoff
Copy link
Collaborator

Add support to calculate IL basic blocks, enabling basic block scope and loop detection for .NET files.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

@github-actions github-actions bot dismissed their stale review February 24, 2023 21:48

CHANGELOG updated or no update needed, thanks! 😄

@mike-hunhoff
Copy link
Collaborator Author

@mr-tz / @williballenthin before I dig in here any thoughts as to why tests would only be failing on Python 3.7?

@mr-tz
Copy link
Collaborator

mr-tz commented Feb 27, 2023

Nope, but looks like something with pydantic...

@ghost
Copy link

ghost commented Feb 27, 2023

https://github.com/mandiant/capa/blob/master/capa/features/freeze/__init__.py#L49

this line recently changed but i have no idea why it would fail on 3.7. i'll try to reproduce locally in a 3.7 virtualenv and see whats going on.

@mike-hunhoff
Copy link
Collaborator Author

Most recent push improved how we emit basic block matches. Below we see basic block matches where the method token + offset is used for both the matched basic block and matched instruction in the basic block. This allows a user to quickly identify the exact basic block and instruction matched in a given method.
Screen Shot 2023-02-27 at 12 08 13 PM

@mike-hunhoff
Copy link
Collaborator Author

@williballenthin looks like tests are passing after my latest commit.....very weird the tests were only failing on Python 3.7. I noticed there was a bug manually which led to the fix but I'd rather this have been picked up by our testing suite.

@ghost
Copy link

ghost commented Feb 27, 2023 via email

@ghost
Copy link

ghost commented Feb 27, 2023 via email

Copy link
Collaborator

@mr-tz mr-tz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. A few comments and maybe a larger question if this should maybe life in dncil or dnfile?

capa/features/extractors/dnfile/extractor.py Show resolved Hide resolved
Comment on lines +204 to +205
bb_next: DnBasicBlock = fh.ctx["blocks"][idx + 1]
bb.succs.append(bb_next)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this always hold just based on the next index?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my thought process (please check my sanity 🙃):

  • We process instructions sequentially resulting in leaders getting processed sequentially
  • Each time we encounter a leader we create a new basic block and add the block to our list resulting in basic blocks getting processed sequentially
  • Therefore, if a basic block has a fallthrough successor we can assume the successor is the next basic block in our list

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense and should go in there as a comment.

What about BBs that do not return (IDA calls them fcb_noret and fcb_enoret)? They occur for example on call abort, RtlFailFast, or potentially on exception handling.

Another corner case from IDA at the call 0x100304ae:
2023-03-01_09-42-28_ida

ID: 8, Start: 0x1003045e, Last instruction: 0x10030465, Size: 12, Type: fcb_normal
  -> ID: 16, Start: 0x100304ae, Last instruction: 0x100304b7, Size: 10, Type: fcb_normal

So we should handle call and jmp at the end of BBs. And maybe exception handling could throw us of?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree the above should be in a comment, just for clarity.

i'm not sure whether or not .NET supports non-returning functions. its possibly that the runtime still requires the ret even if its dead code. we should try to figure this out and definitely document the results here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other assumptions we should probably document (and please verify):

  • .NET doesn't support tail calls, i.e. jmp to other routine
  • .NET doesn't support shared function chunks, i.e. two different entry points to the same function body

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and, right now i see that we're not doing anything with exceptions. i'm not sure how we'd want to represent this in the CFG. there would be a couple things to consider:

  • is there a separate graph for each exception handler? how to refer to them?
  • how to link the try/catch region with the exception handler graph?
  • the end of a try/catch region probably splits a basic block

we may want to figure out how we need to operate on exception handlers before finalizing the decisions here. so maybe this is not a blocker, but something we add once we know we need it.

capa/features/extractors/dnfile/extractor.py Outdated Show resolved Hide resolved
@mike-hunhoff
Copy link
Collaborator Author

I like the rendering in the screenshot you shared - lots of detail and the important stuff is highlighted. Do you think we should also try to include the function name, when possible? Using tokens is very precise and I don't think this should go away; however, I wonder if including the function name might give more context (especially when the names aren't obfuscated).

I'm honestly torn here. I think displaying un-obfuscated method names is valuable, however, my work has seen few samples where this is the case. More frequently, I see large, obfuscated names that would be a pain to handle (display) correctly.

We could take a route simliar to de4dot by attempting to identify and rename obfuscated method names but I see that potentially confusing users who pivot from capa to a tool like dnSpy.

Reading your original message I see you say "when possible" which could be a valid path forward, e.g., we only display method names when we can identify the name has not been obfuscated at all (or past an established threshold).

@mike-hunhoff
Copy link
Collaborator Author

mike-hunhoff commented Feb 28, 2023

Nice work. A few comments and maybe a larger question if this should maybe life in dncil or dnfile?

@mr-tz This came to mind about halfway through writing the code haha. I think the basic block logic likely belongs in dncil. Once we've addressed any issues here I'll move the basic block logic to dncil and update this PR to use the updated dncil APIs.

@mike-hunhoff
Copy link
Collaborator Author

Nice work. A few comments and maybe a larger question if this should maybe life in dncil or dnfile?

@mr-tz This came to mind about halfway through writing the code haha. I think the basic block logic likely belongs in dncil. Once we've addressed any issues here I'll move the basic block logic to dncil and update this PR to use the updated dncil APIs.

@mr-tz @williballenthin see mandiant/dncil#55

Comment on lines +148 to +150
if idx == 0:
# add #1
leaders.add(insn.offset)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could pull this out of the loop so its not re-checked on each iteration

Comment on lines +147 to +150
for idx, insn in enumerate(fh.inner.instructions):
if idx == 0:
# add #1
leaders.add(insn.offset)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for idx, insn in enumerate(fh.inner.instructions):
if idx == 0:
# add #1
leaders.add(insn.offset)
leaders.add(fh.inner.instructions[0])
for idx, insn in enumerate(fh.inner.instructions):

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming there's at least one instruction for each function

leaders.add(insn.offset)

if any((insn.is_br(), insn.is_cond_br(), insn.is_leave())):
# add #2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# add #2
# add #2: targets of branches are leaders

if any((insn.is_br(), insn.is_cond_br(), insn.is_leave())):
# add #2
leaders.add(insn.operand)
# add #3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# add #3
# add #3: fallthrough instructions after conditional branches are leaders

# calculate basic block leaders where,
# 1. The first instruction of the intermediate code is a leader
# 2. Instructions that are targets of unconditional or conditional jump/goto statements are leaders
# 3. Instructions that immediately follow unconditional or conditional jump/goto statements are considered leaders
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3: ... immediately follow unconditional ...

can you convince me of this? i think the instruction must be the target of a branch in order to be a leader. otherwise, its unreferenced code.

Comment on lines +168 to +172
fh.ctx["blocks"].append(bb_curr)
continue

assert bb_curr is not None
bb_curr.instructions.append(insn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fh.ctx["blocks"].append(bb_curr)
continue
assert bb_curr is not None
bb_curr.instructions.append(insn)
fh.ctx["blocks"].append(bb_curr)
else:
assert bb_curr is not None
bb_curr.instructions.append(insn)

optional style to clarify one of two cases happens here

Comment on lines +177 to +179
if len(bb.instructions) == 0:
# TODO: consider error?
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(bb.instructions) == 0:
# TODO: consider error?
continue
assert len(bb.instructions) > 0

i dont see how this can ever not be the case, given that the bb is initialized with at least one element above. so, we could keep this assertion around, or also remove it.

Comment on lines +184 to +186
if len(bb.instructions) == 0:
# TODO: consider error?
continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(bb.instructions) == 0:
# TODO: consider error?
continue
assert len(bb.instructions) > 0

ditto

Comment on lines +203 to +208
try:
bb_next: DnBasicBlock = fh.ctx["blocks"][idx + 1]
bb.succs.append(bb_next)
bb_next.preds.append(bb)
except IndexError:
continue
Copy link
Collaborator

@williballenthin williballenthin Mar 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
try:
bb_next: DnBasicBlock = fh.ctx["blocks"][idx + 1]
bb.succs.append(bb_next)
bb_next.preds.append(bb)
except IndexError:
continue
try:
bb_next: DnBasicBlock = fh.ctx["blocks"][idx + 1]
except IndexError:
continue
else:
bb.succs.append(bb_next)
bb_next.preds.append(bb)

make region that can throw and handle the exception as small as possible for clarity

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the IndexError a programming error? i suppose we'd hit that case at ret and/or any other instruction at end of function body. can we codify this in any way? not a priority.

@williballenthin
Copy link
Collaborator

I'm honestly torn here. I think displaying un-obfuscated method names is valuable, however, my work has seen few samples where this is the case. More frequently, I see large, obfuscated names that would be a pain to handle (display) correctly.

We could take a route simliar to de4dot by attempting to identify and rename obfuscated method names but I see that potentially confusing users who pivot from capa to a tool like dnSpy.

Reading your original message I see you say "when possible" which could be a valid path forward, e.g., we only display method names when we can identify the name has not been obfuscated at all (or past an established threshold).

I agree with everything you say here. I'd only recommend including the name when its useful and helpful. I do not think we should do any renaming or show any obfuscated names. So, if its reasonably easy to determine if a name is non-obfuscated, then I think we should maybe do that.

How to determine if non-obfuscated? I guess ASCII only, something around entropy, not mixing casing or numbers too weirdly, ...?

We should probably move this into a separate feature request.

@mike-hunhoff mike-hunhoff marked this pull request as draft March 3, 2023 19:26
@mr-tz mr-tz added the dont merge Indicate a PR that is still being worked on label Oct 12, 2023
@mr-tz
Copy link
Collaborator

mr-tz commented Mar 22, 2024

@mike-hunhoff, do we want to keep this around here or close it?

@mike-hunhoff
Copy link
Collaborator Author

@mike-hunhoff, do we want to keep this around here or close it?

We can close - I'll link the corresponding dncil PR back to this for documentation purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dont merge Indicate a PR that is still being worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants