Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeLoadException or BadImageFormatException under heavy multi-threaded proxy generation #193

Open
bluerobotch opened this issue Jul 6, 2016 · 153 comments

Comments

@bluerobotch
Copy link

We use Moq 4.5.10 with Castle.Core 3.3.3 and xUnit 2.1.0 for our unit tests. For xUnit we have the option activated to only have one app domain for all tests. We also enabled parallelization of test execution in xUnit.

Sporadically we get TypeLoadExceptions like the following example:
System.TypeLoadException : Could not load type 'Castle.Proxies.Invocations.IExecutorAccessor_Prepare' from assembly 'DynamicProxyGenAssembly2, Version=0.0.0.0, Culture=neutral, PublicKeyToken=a621a9e7e5c32e69'. at Castle.Proxies.IExecutorAccessorProxy.Prepare[TExecutionUnit](TExecutionUnit executionUnit) ....

We reviewed the source code of Moq to ensure there is no threading issue with the ProxyGenerator caching. But Moq looks fine.

When disabling xUnit test parallelization we don't get TypeLoadExceptions anymore. For this reason we think it is a race condition in Castle.Core or even in the .Net Framework.

As far as we see, the missing type is the type that is created for the invocation by InterfaceProxyWithoutTargetContributor.GetInvocationType. The TypeLoadException seems to be raised in the code that is emitted for the invocation in MethodWithInvocationGenerator.BuildProxiedMethodBody. The invocation type is built before it is used in the emitted code. Therefore we do not see a reason why the type cannot be loaded.

@jonorossi
Copy link
Member

@bluerobotch since you are the only one with a repro and DynamicProxy has been used in a lot of multi-threaded web apps for many years this might not be too easy to track down. Could you pull out fuslogvw to see if you can get any more info on the fusion failure. Thanks.

@bluerobotch
Copy link
Author

@jonorossi thanks for the quick reply. We already tried to get more information by enabling the fusion log. But unfortunately the fusion log does not contain any records concerning the above problem. We also enabled the castle core log, but it did not provide further information. It looks like it's a type loading issue and not a problem with locating the assembly. During investigation we had the TypeLoadException in one of two identical test were the first one failed and the second one succeeded (xunit theory). The first test (failing) created the proxy type and the invocation type. The second one (succeeding) took both types from the cache.
We tried to reproduce our behavior by generating 1000 unit tests each containing a theory and using one of 1000 interfaces to be proxied by moq. Then we run these 1000 test in parallel, but we never got an TypeLoadException so far.
Do you have any idea in which direction we should investigate further? Is there a possibility to get (payed) help from the castle project team for example by reviewing our code during a remote session?

@jonorossi
Copy link
Member

@bluerobotch could you include the TypeLoadException's full a stack trace. Looking through the issue trackers of the 3 main mocking libraries I found devlooped/moq#246, does that issue sound about the same?

@Schaeri
Copy link

Schaeri commented Jul 12, 2016

Hello @jonorossi

Thanks for the fast answer and support. I created the issue on moq. Yes it is the same problem described in devlooped/moq#246. @bluerobotch and I work on the same project. Together we trying to find the root cause of the issue since february without any luck. Locking into the code of castle and moq.

You can use the stack trace of the issue devlooped/moq#246. Its never the same test but as an example trance it should work for the TypeLoaderException.

@jonorossi
Copy link
Member

jonorossi commented Jul 13, 2016

@bluerobotch @Schaeri to narrow things down could you provide the exact version of Moq you are using so we are looking at a single version of Castle Core. Also have you got any other use of Castle DynamicProxy via other mocking libraries or something else?

@jonorossi
Copy link
Member

jonorossi commented Jul 13, 2016

@bluerobotch @Schaeri scratch most of that, I see you've included the Moq version in this issue. The question about other usage is still relevant.

@Schaeri
Copy link

Schaeri commented Jul 13, 2016

Yes we have other usage of castle core inside our project. We need castle core for our WCF infrastructure and to extend our dependency inject framework. This infrastructure is also covered in some unit tests.
Other frameworks and libraries we use doesn’t depend on castle core. Moq ist the only one.

For a test I eliminate our castle core dependency in our project and run our unit test again. The error currently not occurred. But I have to run the tests over night to see if this is really an issue.

What can cause this error when we also use castle core inside our project? Or what does we do wrong when using castle core?

@jonorossi
Copy link
Member

@Schaeri at the moment I have no idea, just trying to guess things that might be different with your project which is somehow unique with no other reports of this problem.

There might be a defect in DynamicProxy when two ProxyGenerators are used at the same time and this is only surfacing because of how they are set up during unit tests.

Let me know the results of that overnight unit test run, and anything else your project is doing that you think might be out of the ordinary (e.g. mocking COM interfaces).

@Schaeri
Copy link

Schaeri commented Jul 13, 2016

Unfortunately the error still occurs. Currently I have no clue what we do special that could provoke such an error. @bluerobotch and I will sit together and rethink the situation. Maybe we have an other idea what could be special about our project.

@jonorossi
Copy link
Member

@Schaeri no worries, at least it narrows it down by exclusion. Unfortunately I'm pretty swamped at the moment otherwise I'd have accepted your request for a remote session. I assume since you have been looking at this since February you have a workaround, maybe running your tests without parallelisation?

@Schaeri
Copy link

Schaeri commented Jul 14, 2016

No at the moment we haven’t a workaround. When we switched from xunit 1.x to 2.x we did a lot of performance optimizations to run our tests as fast as possible. Execute our tests synchronous means we have to adapt our whole project structure to reach the same build and integrations times.

But yesterday I was able to reproduce one of our issue we have with castle on a very simple example. The attached example contains 1001 interfaces and 1001 test class. Each test class mocks a single interface and setup the DoSomething method on it. Then it calls the method and make sure it returns the setup value. This is done in a theory with the value true and false. The Run.bat configure the xunit runner with the parameters we use in our project (run test classes in parallel) and repeat the execution of the test until they fail.

They will fail with a BadImageFormatException thrown inside castle.core (see screenshot). To reproduce the error, the test must run 1 to 8 hours. But it will happen every try. The Run.bat can be started more than one time to increase the chance to get the error faster.

We have to deal in most cases with the TypeLoaderException but the BadImageFormatException occurs also sometime. And I think both exception have the same origin.
Maybe @jonorossi you find time to run the example and see the error which will occur. And thanks a lot for the support, we really appreciate it.

CastleProblem.zip
badimageformatexception

@jonorossi
Copy link
Member

@Schaeri great to hear you've been able to put together a repro. I've had it running now on my Windows VM for just over 8 hours and it hasn't failed. I'll run another few copies to try to push it to fail. What is your machine and build server set up, i.e. how many CPUs, are they virtualised.

@jonorossi
Copy link
Member

@Schaeri could you also let me know what version of the .NET Framework you are running. Follow this page to find out exactly. My version is 4.6.1 for non-Win10 (394271).

@jonorossi
Copy link
Member

@Schaeri just got a BadImageFormatException! Different message which definitely indicates something screwy.

System.BadImageFormatException : Could not load file or assembly 'DynamicProxyGenAssembly2, Version=0.0.0.0, Culture=neutral, PublicKeyToken=a621a9e7e5c32e69' or one of its dependencies. Index not found. (Exception from HRESULT: 0x80131124)

@Schaeri
Copy link

Schaeri commented Jul 15, 2016

Currently we run our tests in a loop on our developer notebooks for analyze the issue and for increasing the chance for getting the error. Intel i7 with 4 cores with hyper threading. The OS is a windows 7 64 bit and the .net framework is 4.6.1 (394271). The errors on our developer notebooks and build servers seams the same. But our build agents running on VMware vSphere 5.5 on Intel Xeon CPUs. The framework is the same of our build agents.

I run the sample a lot and get every time the exception on the screenshot (at least 10 times). But on our build server I saw also the error System.BadImageFormatException : Could not load file or assembly. The problem seems to have different characteristic.

@jonorossi jonorossi changed the title TypeLoadException when intercepting a member with InterfaceProxyWithoutTarget TypeLoadException or BadImageFormatException under heavy multi-threaded proxy generation Jul 22, 2016
@Schaeri
Copy link

Schaeri commented Jul 29, 2016

@jonorossi we saw that you have labeled our issue as bug. Can we somehow contribute to help resolving the bug? And again, we appreciate your help and fast response.

@jonorossi
Copy link
Member

@Schaeri I haven't actually confirmed it is a bug in DP rather than the .NET runtime (i.e. still don't know what causes it), however I could reproduce it. The next week for me is going to be busy so I won't be able to look into it, we need to get to the bottom of the problem obviously before we can fix it.

@ghost
Copy link

ghost commented Mar 29, 2017

Guys, just an update on what I found when I tried to parallelise the tests for castle core using win10-x64(host install) doing multi framework targeting.

The modulescope class was kicking out errors because of I/O collisions between threads when saving strong named assemblies. The reason is because it uses fixed file names.

I hacked this locally to do dynamic assembly naming locally. After that multi threaded proxy gens worked and tests started passing.

The trade off unfortunately, is that you cannot leverage friend assemblies easily that are strongly named by using internals visible to attributes.

Unless this has been solved, you might want to create a test that batters ModuleScope in a multi-threaded context calling SaveAssembly(true). My 2 cents anyway...

@Schaeri
Copy link

Schaeri commented Apr 4, 2017

Hello @Fir3pho3nixx. Can you attach a patch for your modulescope class fix. I would like to check if this would solve our problem we have. Thanks for the help.

@ghost
Copy link

ghost commented Apr 6, 2017

Here is my monkey patch for this:

*** Edited/Removed for TL;DR ***

@Schaeri
Copy link

Schaeri commented Apr 14, 2017

Thanks for the patch. I applied your changes and build castle.core and Moq with the modifications. I include the updated package into my previous posted CastleProblem project and tested the change now for over a week. Its sad but I still get the type loader exceptions from time to time. With and without modification the occurrence seems to be the same of the problem. Any other idea what can cause the problem? Thanks for the support.

@ghost
Copy link

ghost commented Apr 15, 2017

Will download Castle Problem zip file and start digging around to see if I can replicate this.

@Schaeri
Copy link

Schaeri commented Apr 15, 2017

Thanks for the support. The error is very rar. The test must run for several hours until it will occur. If you need any help let me know.

@ghost
Copy link

ghost commented May 9, 2017

Just an update on this issue, I raised a new one here: #253

They could be related or not. Would like to check this out first before I come back to this. Will let you know what the outcome is.

@Schaeri
Copy link

Schaeri commented May 11, 2017

Thanks for the support. Yes let me know what the outcome is. Or when you have a patch/pre-realease to do further testing.

@ghost
Copy link

ghost commented Jun 29, 2017

@Schaeri - We just fixed #277, are you guys using any modopt/modreq modifiers?

@BrunoJuchli
Copy link
Contributor

https://github.com/BrunoJuchli/Core/tree/ModulePerProxy has run for 96hours / 4 days straight without one failure.
I had to switch off the machine for the weekend.


@stakx
If at any time you feel the urge to kill boredom: feel free to add generation of something like IInvocation and the newobj IL to https://github.com/BrunoJuchli/CastleCore193Repro/tree/DropDynamicProxy ;-)
(I didn't have time to do so, so far :-( )

@stakx
Copy link
Member

stakx commented Jun 29, 2018

@BrunoJuchli - Interesting news! It appears that after many months and some wrong turns, we're getting somewhere now. even though most of that time is now spent waiting for several days. 😄

I might be able to whip up a simple implementation of IInvocation & IInterceptor. I'd however put a few constraints on it to keep it simple:

  • Only a single interceptor allowed.
  • No by-ref or by-pointer parameters allowed.

I'll send a PR your way if I get around to it.

@stakx
Copy link
Member

stakx commented Jul 2, 2018

@BrunoJuchli - didn't send a PR, but I wrote this program (Gist) which mimicks DynamicProxy's interceptors and invocations. I've run 4 instances of that program for 6 hours, but that wasn't enough to repro the issue. Maybe you can still use any parts of it.

@TimLovellSmith
Copy link
Contributor

@stakx I looked at your program and noticing it does some parallelism, somehow I still ended up thinking 'how is it going to repro the issue if it isn't doing locking?'

The reason its not doing locking AFAICS is that it assumes its safe to do so based on a) the order in which the types are generated. b) the fact there aren't going to be multiple threads which can try to write to the same type.
We don't make such assumptions in Castle because in general its hard to know whether there will be multiple threads that write to the same type...

So the net result, I haven't logically figured out why I am thinking this thought, but it still nags at me, could it fail to repro because what it is doing is too linearized somehow compared to the other repro we have?

@TimLovellSmith
Copy link
Contributor

Or is it just because they are intercepts of interface calls..?

@TimLovellSmith
Copy link
Contributor

Ideas for mixing it up:
-do the activator create instance call on each type 'as soon as it is generated'?
-try a virtual method of a base class?
-do generation of types from multiple threads, but with locking, as the current way is still single threaded, in case there is e.g. some thread-local state which causes the bug?

@stakx
Copy link
Member

stakx commented Jul 5, 2018

@TimLovellSmith: That Gist was simply trying to repro the issue under the assumption that neither Activator.CreateInstance nor the IL newobj instruction are as thread-safe as they should be. Hence no locks for the read parties, and the single write party doesn't need locking because it's already generating types serially.

Your first post seems somewhat contradictory, "too linearized" is what you'd get with locking, yet there's none of that as you noticed, so "too linearized" doesn't appear to be a likely reason for why the repro fails to fail. (Unless I've made a mistake with the use of the TPL.)

It was a quick try, but since it wasn't successful, I'm not going to invest much more time there (but feel free to run further experiments, we can need all the help we can get here!). Chances of a successful minimal repro are perhaps higher if we go back to DynamicProxy and start taking it apart as much as we can... DynamicProxy does a lot more stuff than my repro, who knows which of these many differences is responsible for the errors!?

Btw., if you want to run further experiments based on my Gist, I suspect it would be important to let it involve types that force Ref.Emit to rewrite e.g. method signatures (IL metadata tokens referring to types are always module-specific so they cannot be reused in a different module), my current suspicion is that this is one place where things might be going wrong, and my repro likely doesn't cause any signature rewriting. A start would be to have the 1,000 interface types in the repro code from which the generated proxy types in the dynamic module inherit.

@iberodev
Copy link

iberodev commented May 31, 2019

It seems the issue I am experiencing with integration tests running in parallel is related to this
https://stackoverflow.com/questions/56254916/could-not-load-type-castle-proxies-ireadinessproxy-when-running-xunit-integratio

@BrunoJuchli
Copy link
Contributor

@iberodev Are you experiencing it sporadically or can it be reproduced everytime when running tests in parallel?
Which .net framework / .net core version are you on?

@iberodev
Copy link

iberodev commented May 31, 2019

@BrunoJuchli I experience this sporadically, unfortunately. I suspect it has something to do with the assembly loading and Autofac. Funnily enough, when I experience it, if I restart my PC often the problem goes away (like right now.. of course! :/ )
I use the dotnet framework 2.2.101
My class libraries are netstandard2.0 and the xUnit projects are netcore 2.2
I have added to the stackoverflow question an UPDATE 3 with a link to the repository where sometimes I reproduce this error.

@BrunoJuchli
Copy link
Contributor

I ran https://github.com/stakx/CastleCore193Repro/tree/master/NetCore/NetCoreRepro again yesterday. Switched to .net Core runtime 2.2.5. After 6 hours 2 out of 32 failed with a TypeLoadException for type '' (name => empty quotes).
I wanted to test this with .net Framework 4.8 for quite some time now but we haven't updated yet and I've not got around to setup a VM to test it out.

@BrunoJuchli
Copy link
Contributor

BrunoJuchli commented Jun 18, 2019

So, I've tested .net 4.8 with Castle.Core 4.4. 32 processes, on my 32 Logical Cores machine, for 14 hours. 5 Out of them failed.

I've got the following exceptions:

BadImageFormatException: Index not found. (Exception from HRESULT: 0x80131124)
at Castle.Proxies.Derived195Proxy..ctor(IInterceptor[] )
--- End of inner exception stack trace ---
at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor)
at System.Reflection.RuntimeConstructorInfo.Invoke(BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
at System.RuntimeType.CreateInstanceImpl(BindingFlags bindingAttr, Binder binder, Object[] args, CultureInfo culture, Object[] activationAttributes, StackCrawlMark& stackMark)
at System.Activator.CreateInstance(Type type, BindingFlags bindingAttr, Binder binder, Object[] args, CultureInfo culture, Object[] activationAttributes)
at System.Activator.CreateInstance(Type type, Object[] args)
at Castle.DynamicProxy.ProxyGenerator.CreateClassProxyInstance(Type proxyType, List`1 proxyArguments, Type classToProxy, Object[] constructorArguments)
at Castle.DynamicProxy.ProxyGenerator.CreateClassProxy(Type classToProxy, IInterceptor[] interceptors)


BadImageFormatException: The signature is incorrect.
at Castle.Proxies.Derived265Proxy..ctor(IInterceptor[] )
--- End of inner exception stack trace ---
at System.RuntimeMethodHandle.InvokeMethod(Object target, Object[] arguments, Signature sig, Boolean constructor)
at System.Reflection.RuntimeConstructorInfo.Invoke(BindingFlags invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
at System.RuntimeType.CreateInstanceImpl(BindingFlags bindingAttr, Binder binder, Object[] args, CultureInfo culture, Object[] activationAttributes, StackCrawlMark& stackMark)
at System.Activator.CreateInstance(Type type, BindingFlags bindingAttr, Binder binder, Object[] args, CultureInfo culture, Object[] activationAttributes)
at System.Activator.CreateInstance(Type type, Object[] args)
at Castle.DynamicProxy.ProxyGenerator.CreateClassProxyInstance(Type proxyType, List`1 proxyArguments, Type classToProxy, Object[] constructorArguments)
at Castle.DynamicProxy.ProxyGenerator.CreateClassProxy(Type classToProxy, IInterceptor[] interceptors)
at Program.<>c__DisplayClass0_0.

b__1(Type typeToProxy) in C:\work\CastleCore193Repro\Net48\NetCoreRepro\Program.cs:line 20


I couldn't debug them with VS2019, it hung when loading...

@jcageman
Copy link

jcageman commented Dec 30, 2022

I have a very similar issue when calling the following code in one of the tests:

AppDomain.CurrentDomain.GetAssemblies()
            .SelectMany(assembly => assembly.GetTypes())
           .ToList();

the error's i am getting is always regarding a mocked interface that was used in one of the tests ran before the above code:
Could not load type 'Castle.Proxies.IMyClientProxy' from assembly 'DynamicProxyGenAssembly2, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null'.

It happens a bit random as well and also happens when running non-parallel in my case, but as mentioned it seems that it's required to run some "mocked" tests in the same project before this code is executed to make it happen. My guessing work is that a previous test created some "proxy" types and as soon as that test class is finished they are "unloaded" again, which makes them unavailable for a later test.

I am using xunit 2.4.1 and Moq 4.17.2 (which uses castle 5.0.0).

@stakx
Copy link
Member

stakx commented Dec 30, 2022

Thanks for chiming in!

My guessing work is that a previous test created some "proxy" types and as soon as that test class is finished they are "unloaded" again, which makes them unavailable for a later test.

This seems unlikely. The .NET runtime cannot unload single types. The .NET Framework can unload AppDomains, and .NET Core recently gained the ability to unload collectible assemblies. But IIRC, neither AppDomains nor collectible assemblies are involved in the original scenarios discussed here.

@maxcherednik
Copy link

Having the same issue here: FakeItEasy/FakeItEasy#1910 with FakeItEasy 7.3.1 and Castle.Core 4.3.1.

@maxcherednik
Copy link

I was exploring the codebase of the FakeItEasy and the Castle Dynamic proxy.
Was searching for some kind of shared state where the type could be confused somehow, but did not find anything specific.

I have it failing a couple of times per day.
The set of tests is roughly the same, which is good - I can put some extra logging to those tests.
I was thinking to reflect a bit on the generated type to see if there is anything strange about it.

Any ideas what I should be adding to those logs?

@stakx
Copy link
Member

stakx commented Jan 31, 2023

@maxcherednik, if you read the whole thread above, you'll find that we last suspected a bug in the runtime. To diagnose this problem further, one would presumably have to set up the runtime for debugging (being able to get full stack traces and step into its source, etc.). I didn't manage to get a stable, reliable CoreCLR dev environment at the time; perhaps someone else is more lucky. I can't really give any more precise advice without resuming that work myself, unfortunately.

@maxcherednik
Copy link

maxcherednik commented Feb 1, 2023

@stakx sorry for the stupid questions. I have read the whole thread - I see you guys having fun here and I am totally late to the party.

From the history I see:

  1. That there was no reliable reproduction
  2. We already suspect the CLR itself - but this is just a theory and no one managed to confirm it or narrow down the search area.

Since I am new to the party I am trying to double check and validate certain ideas.
I am investigating it from the FakeItEasy side - so that I could eliminate the possibility that the bug is on the FakeItEasy side or Castle side or CLR or maybe it is completely different issue.

I have added some logs to the failing tests so that we could inspect the generated types. I see some strange behavior which might give us a lead.

FakeItEasy/FakeItEasy#1910 (comment)

@stakx
Copy link
Member

stakx commented Feb 1, 2023

@maxcherednik, a few points in random order:

  • I'm noticing that the issue you're referring to also mentions generic type arguments, one having a name that appears to come out of nowhere (TEvent). This reminds me of another recent issue filed here, Random VerificationException/TypeLoadException #648.

  • You mentioned that your VerificationException started appearing only after you upgraded from .NET 4 to 6. Two arguments could be made that are consistent with what we've established above (IIRC): (a) The update could mean that the error indeed lies in the runtime, because (if) everything else stayed the same. (DynamicProxy doesn't have a lot of conditional compilation left in its source code, it's mostly the same code regardless of the targeted platform.) (b) The error started surfacing because .NET 6 is a lot faster than .NET 4, and the error typically takes a while to surface.

  • The problem is much more likely to happen on a multi-core machine, when tests / proxy generation happens in multiple concurrent threads. Therefore I am assuming this is a multi-threading related issue.

  • I strongly suspect the error lies in the framework because (a) DynamicProxy is basically single-threaded, and makes sure via a coarse-grained lock that only one type gets "baked" by Reflection.Emit at a time. And (b) DynamicProxy's test suite includes a step that PEVerify-ies generated code. The test suite should cover the kind of simple proxy generation code that we've seen triggering this problem, so it seems unlikely that DynamicProxy generates invalid code. We likely would have noticed and caught that a long time ago.

  • At the time I'm still not knowledgeable enough to do CoreCLR debugging, and back at the time we didn't really know a good way to report a problem with the CoreCLR team that would've gotten them interested, since the exceptions couldn't be easily reproduced.

@maxcherednik
Copy link

maxcherednik commented Feb 1, 2023

I'm noticing that the issue you're referring to also mentions generic type arguments, one having a name that appears to come out of nowhere (TEvent). This reminds me of another recent issue filed here, #648.

Indeed seems the same.

I strongly suspect the error lies in the framework because ...

Might be. I am still trying to leave as much trace as possible. For the ones who is going to be fixing it on the CLR side. For example, the bug might be during the type generation or during the invocation. I am currently inclined towards the type generation. From the logs it is clear that the generic method of the generated type differs from the generic method of the interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants