-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SharedCache] Use m_exportInfos
as an export list cache
#6197
base: dev
Are you sure you want to change the base?
Conversation
`SharedCache::ParseExportTrie` is getting called a lot during DSC library loading and analysis. In large part due to the hot path `SharedCache::FindSymbolAtAddrAndApplyToAddr`. Its unnecessary for it to be being called more than once per DSC header as the export list symbol information is stored in `SharedCache::m_exportInfos`. This commit adds the function `SharedCache::GetExportListForHeader`, which will either return the header's list of symbol information cached in `SharedCache::m_exportInfos` or call `SharedCache::ParseExportTrie` and cache the results in `SharedCache::m_exportInfos`. This should also improve the execution time of `SharedCache::LoadAllSymbolsAndWait`. Further improvement here would be to add locking to `SharedCache::GetExportListForHeader` so that races don't result in redundant parsing of the export trie for the same header if multiple threads call `SharedCache::GetExportListForHeader` at the same time for the same header. This only really matters during initial loading because from what I can tell that parses all the export trie's anyway.
return std::vector<std::pair<uint64_t, std::pair<BNSymbolType, std::string>>>(); | ||
|
||
auto exportList = SharedCache::ParseExportTrie(linkeditFile, header); | ||
std::vector<std::pair<uint64_t, std::pair<BNSymbolType, std::string>>> exportMapping(exportList.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is now the only caller of ParseExportTrie
, it'd be beneficial to have it produce data in the format that this expects. This will eliminate the need for ParseExportTrie
/ ReadExportNode
to allocate a bunch of Symbol
instances only for this code to copy data out of them and then discard them. All those memory allocations and copying all of the names around is surprisingly expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that was lazy of me, I'll add another commit to do that
@@ -2779,6 +2779,31 @@ std::vector<Ref<Symbol>> SharedCache::ParseExportTrie(std::shared_ptr<MMappedFil | |||
return symbols; | |||
} | |||
|
|||
|
|||
std::vector<std::pair<uint64_t, std::pair<BNSymbolType, std::string>>> SharedCache::GetExportListForHeader(SharedCacheMachOHeader header, std::function<std::shared_ptr<MMappedFileAccessor>()> provideLinkeditFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns a copy of this chonky vector each time it is called. If the thread-safety story permits it, it would be better to return a const reference instead.
It might be worth replacing this whacky nested std::pair
with a struct with named fields that describe what they contain. That'd clear up a lot of the confusing pair.first
/ pair.second.first
stuff. There's an ExportNode
structure declared in this file that looks to be intended for this purpose, but is currently never used for anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be possible as once a vector is created it shouldn't be reallocated by m_exportInfos
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thread safety problems arise when m_exportInfos
is modified. If the hash table needs to grow it will reallocate and moves its contents. This would invalidate any existing references to data it stores.
One solution that doesn't require any additional locking would be to add a layer of indirection: have m_exportInfos
store a std::unique_ptr
to the vector. Then the vector itself will never move, only the pointer that owns it. You could then safely return a reference to the pointed-at vector from this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure using std::unique_ptr
is pratical here because m_exportInfos
in ViewStateCacheStore
causes compiler errors due to copying mechanics when m_exportInfos
has a unique_ptr
type in its template. shared_ptr
works fine and I don't see much of an issue in using that instead.
auto exportList = GetExportListForHeader(*header, [&]() { | ||
try { | ||
auto mapping = MMappedFileAccessor::Open(m_dscView, m_dscView->GetFile()->GetSessionId(), header->exportTriePath)->lock(); | ||
doSave = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only difference between the various "open a file" lambdas passed to GetExportListForHeader
? If so you could simplify these call sites by having GetExportListForHeader
take bool* didModifyExportList = nullptr
. Callers that are interested in knowing that m_exportInfos
was modified could pass a pointer to bool in. Others would omit it. This would let you move the file opening logic into GetExportListForHeader
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have done this but the call in SharedCache::InitializeHeader
has a different way of loading a mapped file.
bool* didModifyExportList = nullptr
- I didn't do this just cause it seemed this is the only instance where it mattered so it seemed 50/50 whether the extra parameter made sense or what I did. For future stuff though it might make more sense to have the parameter so its clear to a caller that they need to do a save if they didn't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I missed that SharedCache::InitializeHeader
is doing something different. Bummer!
…ck if `m_exportInfos` was modified This probably makes more sense than the current solution of using execution of the callback parameter to determine if `m_exportInfos` was modified.
This commit changes 2 things; 1. `m_exportInfos` is now a map where its values are also a map rather than a vector of pairs. The reason for this is that `SharedCache::FindSymbolAtAddrAndApplyToAddr` is a hot path which does by far the most accesses to `m_exportInfos`. In that function it must find the correct symbol for a given address so a map lookup will be much quicker than iterating a vector. The other use cases of `m_exportInfos` would prefer a vector but they are executed very infrequently. 2. The symbols are stored in `m_exportInfos` as references to the `Symbol` type. This makes more sense because otherwise there is a lot of time spent converting to and from a `Symbol` type and a pair of `BNSymbolType` + a `std::string`.
…::FindSymbolAtAddrAndApplyToAddr` if a symbol is found
This avoids expensive copying when returning a value from the map in `SharedCache::GetExportListForHeader`. Additionally it ensures that the value stays alive and at the same location in memory if `m_exportInfos` is modified and requires its storage to be re-allocated. I was unable to use a `unique_ptr` instead of a `shared_ptr` because of copy semantics with `m_exportInfos` in `ViewStateCacheStore`. I don't see things being any worse using `shared_ptr` instead of `unique_ptr` anyway and it means less code changes.
I think these latest commits fix all the issues @bdash mentioned in the comments. Also there maybe some race issues here with accessing |
@@ -2990,7 +2998,7 @@ bool SharedCache::SaveToDSCView() | |||
c.m_dyldDataRegions = m_dyldDataRegions; | |||
c.m_nonImageRegions = m_nonImageRegions; | |||
c.m_baseFilePath = m_baseFilePath; | |||
c.m_exportInfos = m_exportInfos; | |||
//c.m_exportInfos = m_exportInfos; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you commented this out while testing the std::unique_ptr
approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup 🤦
A significant number of symbols are not being defined because there is currently no support for parsing the .symbols cache file. This commit adds that support. `m_symbolInfos` has been modified to be a vector of references to `Symbol`s which is similar to [this PR](Vector35#6197). It makes more sense for the use case where `m_symbolInfos` is used as a symbol cache, otherwise a bunch of time is spent transforming between the old style of `m_symbolInfos` entries to Binary Ninja `Symbol`s. This commit does require a metadata version bump. I felt this was necessary to determine which symbols to load from the symbols cache. The problem is that the `m_images` container does not store the images in the order they are found in the DSC. The index they are at determines the location of their symbols in the symbols cache file. Rather than converting `m_images` to a vector and relying on its ordering being correct, it seemed more prudent to store the index of the image in the `CacheImage` structure. As this is serialized, the metadata version has to be bumped to accomodate the change.
Thank you so much for this PR, this was the intended original functionality of the export trie list in memory! Glad to merge this sometime over the next couple of weeks, may need to be looked at in conjunction with the things listed on #6210. |
Hi! Would you be able to rebase this on the latest commit on dev? |
Yes, I'll sort that out now |
SharedCache::ParseExportTrie
is getting called a lot during DSC library loading and analysis. In large part due to the hot pathSharedCache::FindSymbolAtAddrAndApplyToAddr
. Its unnecessary for it to be being called more than once per DSC header as the export list symbol information is stored inSharedCache::m_exportInfos
.This commit adds the function
SharedCache::GetExportListForHeader
, which will either return the header's list of symbol information cached inSharedCache::m_exportInfos
or callSharedCache::ParseExportTrie
and cache the results inSharedCache::m_exportInfos
.This should also improve the execution time of
SharedCache::LoadAllSymbolsAndWait
.Further improvement here would be to add locking to
SharedCache::GetExportListForHeader
so that races don't result in redundant parsing of the export trie for the same header if multiple threads callSharedCache::GetExportListForHeader
at the same time for the same header. This only really matters during initial loading of the first image because from what I can tell that results in parsing all the export trie's anyway.