-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transactional Installation - Improve concurrent operations (pending) #943
Comments
When running two choco.exe processes at once, it would be best to hold a lock on the pending file for the currently running process and detect if the lock exists for another process. Then skip the removal of a package that has a lock on the file. This is done for 0.10.1. |
Add method for opening a file exclusively.
Open and hold the pending file exclusively open until install is finished. Then remove the file lock. This allows for better concurrent operations when running multiple choco processes at the same time (which isn't necessarily recommended).
When removing packages in a pending state, attempt to open the pending file first. If it fails, log a message about skipping and move on to the next pending file.
I would like to suggest that this: "the new invocation of choco install should proceed without trying to delete the in-progress package. If either fails due to MSI reservation conflicts so be it" Needs better handling than "so be it" - you can ask any admin that has to troubleshoot 3% failures across 10,000 nodes. Apologies for taking a hard line on "so be it" below - no ill intents to the poster of that sentiment - but the below is from hard experience of being the automation engineer having to give an account for failure percents on large scale automated software deployments for software deployment technologies that take a "so be it" approach - including a ton of extra work diagnosing a bunch of the failures to learn "why". I would say that within chocolatey if the MSI "InProgress" flag is set, Chocolatey should have a retry cycle and then eventually fail (all with lots of logging). This would make chocolatey more tolerant of both: 1) itself installing a package in another instance of choco, 2) Something else currently installing software using MSI - like a concurrent automated software distribution that does not use chocolatey or Windows Updates or the end user manually running an install. Keep in mind that when pushing one chocolatey install job to 1000's of machines using automated software distribution (especially desktops), the odds of conflicting with another MSI install across all those install instances is much higher. Many manual resolutions can be avoided by simply standing down and polling again. In addition, proper logging and a dedicated exit code (rather than "so be it") would help quickly diagnose that chocolatey could not get MSI services for the package. Would be great to have a dedicated exit code for this condition - as systems like SCCM can pick up that exit code and give statistics reports that reveal a meaningful reason why certain machines failed. SCCM and many other software distribution systems can then be told to re-target these failures (automatically and/or human re-scheduled) If chocolatey does not already do the above natively (something tells me it might), I could file a separate issue since I believe this support would make Chocolatey more enterprise ready regardless of it's ability to support running more than one chocolatey install at a time. Also if the dedicated exit code is not part of the current support, I could file an issue for that. |
I think the ticket you are looking for is #484. Chocolatey does provide package exit codes back up the chain so they can be reported appropriately - that is #512. So you can get that information about why an MSI failed now in current versions of choco. Some enhancements we are planning to do provide better information around failures and things like detecting and waiting. I can't say they will all land in FOSS as some of these improvements seem like an organization's use case and not an individual's use case. |
This is not how most installers work in practice. The retry and clean up logic required for each installer varies on a case-by-case basis, and simply retrying an installer over and over again is not likely to result in success. If you anticipate flaky installs and require logging and a cleanup-retry cycle, this should be part of your chocolatey install scripts and the orchestration that kicks off the chocolatey install. You already have the ability to inspect the installer exit code and the process output. |
I'm going to preemptively de-escalate this - It's known that @DarwinJS has years of experience with Windows installers, typically MSI (this can be seen by some quick research into his github and website). Typically Darwin is speaking from the point of Windows Installer technology (MSI), which does have built-in facilities to let you know other installs are occurring. What Darwin is asking for is for Chocolatey to see those particular exit codes and retry. @masaeedu In the right situations, almost all software has flaky installs. Unfortunately Asad, I don't know your background but I definitely agree once you step outside of MSI, it varies wildly. Sometimes within MSI, but for the most part the error codes and checking are all pretty consistent. So what I am saying is that no one is technically wrong here in what they wrote, understanding different contexts and perspectives. |
@ferventcoder I certainly don't have any experience with the internals of MSI installers, and defer to his experience in this matter. Nevertheless, MSI installers are not the only installers that rely on checking Windows in-progress install conflicts. Since neither MSI installers nor others that are capable of detecting install conflicts (from my use case, InstallShield) are completely idempotent, the fact remains that chocolatey should not be repeatedly attempting the install in the background. I don't have a background as such with any of this, but my use case is deploying and provisioning virtual machines for automated integration testing. Since our environment setup involves network activity and we are heavily loading the hypervisor, invariably some installations fail due to timeouts or newly introduced bugs. When this happens, we don't want or need chocolatey to retry the install behind the scenes, we just need the exit code (which chocolatey already provides), and the orchestration scripts that are invoking chocolatey can figure out whether to dump and rebuild the machine, revert to snapshot, retry after a delay, cleanup and reattempt with older bits, or just report failure to our CI. Many of these scenarios would become more complicated if chocolatey started doing stuff in the background that was not explicitly requested. There are scenarios where MSI and non-MSI installers alike can crash without releasing the installer reservation key, and a reattempt cycle here would not be helpful. For these reasons I stand by my original comment, in that chocolatey should not be in the business of trying to schedule or abort installs in the background. If an install fails due to a conflict, so be it. Whatever agent is invoking chocolatey (Puppet, DSC, ansible, human being sitting at a terminal, whatever) will be responsible for deciding when and whether to retry, or to change the installation process to avoid timing conflicts. At the very least the proposed automatic retry functionality should be hidden behind a new flag. As an aside, I don't think there is any need to "de-escalate" anything here, I think it is fine for different people to have different opinions on what features they want. |
Fair statements. I think what we all understand is that different folks have different perspectives and sometimes hope for knobs they can turn to make software work better for them. This is one of those instances where Chocolatey would allow more knobs for other folks to change the default behavior. That default behavior is possibly the status quo currently and it may continue to be the default behavior. |
Whether or not the package is successful, remove the lock on the pending file. Otherwise the failed install cleanup will not work properly.
* stable: (version) 0.10.1 (GH-943) Remove Transaction Lock Even on Failure (doc) update CHANGELOG/nuspec (doc) add CHANGELOG title/summary (doc) update licensed changelog (GH-458) Warn To Verbose Log For Now (doc) add licensed changelog (maint) formatting (doc) Note Runtime Options For Checksums In Error
My experience is with large scale automated software distribution - it happens MSI was/is very mature in this area due to:
However, I definitely feel that the software distribution is the same story over and over again - so why not steal from man-decades of sunk cost into engineering the problems out of at-scale automated software distribution - it's free for the taking - no need to take 50 iterations to work around the same problems again. Solid return codes would work fine - I can do a retry loop easy enough. For use cases which aren't just immutable infrastructure - like long term instances, real servers and especially desktops - any resilience around crusty end-nodes and not assuming that a given framework is the only software distribution maintaining the node is helpful for adoption. |
I am pretty sure Rob knows this, but you could optimize Chocolatey a little around underlying concurrent MSIs because can find the "InProgress" flag for MSI in the registry and if you know the package you have are about to do is MSI, you could just send the "busy" return code right away rather than execute MSI and wait like 5+ minutes for it to give you the "Another install is in progress" return code. Also you could take a hard lesson from MSI's simplistic "InProgress" flag. IE it would be nice to know the Package Name and the date the flag was thrown, because if the flag is the same in 3-5 days, I bet the client is in an unhealth state - and even us pure DevOps guys like it when technology can self-report its health status :) |
Related to #198 and #822.
From @masaeedu - discussion starting at #822 (comment)
The text was updated successfully, but these errors were encountered: