-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[POWER9] OCC locks up over time if power saving modes enabled #26
Comments
Were any eSELs generated to the BMC? There should be HTMGT traces in /var/log/syslog or /var/log/messages on the host (collect SOS data). You should also confirm that the opal-prd service is running on the host (services opal-prd status) |
The only thing in /var/log/messages is:
opal-prd did record the error and a ton of debugging information however in /var/log/opal-prd.log See attached. |
Looks like OCC generated a log:
This indicates that there was a VFRT timeout (WOF_RC_VFRT_REQ_TIMEOUT 0x00000040) on the second processor and the OCC requests a reset to try to recover. There were some SCOM errors during the reset which is not expected. After the reset, the OCC in first processor was not responsive:
After this next reset, the OCCs did seem to recover:
At this point both OCCs are active and running and I do not see any further errors. From this trace it shows that there should have only been a few minutes in time where the frequency would have been impacted (due to OCCs in recovery.) I believe there should have been eSELs generated. You can confirm the current state of the OCCs with:
If states are active you can send poll commands to the OCCs:
So you are still seeing performance impacts? What is the governor sent to? |
@cjcain The system in question was a critical production system. It has since been restarted with the stop modes disabled so I can't gather current diagnostic information about the fault. That being said, our rack management systems confirm that the OCC did /not/ fully recover. At minimum the BMC was unable to reconnect to the OCC for fan management, and our experience is that once the server enters this state it tends to fail with a hard CPU lockup within a week. |
Was BMC data collected after the failure? (that should have the eSEL data) |
No, we tend not to gather eSEL data since we have no way of decoding / parsing it with the open source tools. My understanding was that opal-prd gathers basically the same diagnostic information, is this not the case? |
opal-prd gathers some error data and sends it to the BMC. The system should be handling the recovery attempts, but if there is a persistent error or need for service, then the data will be sent to the BMC in the form of an eSEL. There is also other data on the BMC that is not known by the host. The BMC also maintains some data across power cycles which the host does not always do. You can run dreport on the BMC to collect data. |
@cjcain Thanks for the info. I'm of the opinion that the OCC shouldn't be crashing at all when stop states are enabled -- we've gone well over a year without any OCC problems with the stop states disabled, but as soon as they are enabled OCC goes down within a few days. The eSEL data may not be too useful (see the following event list). You can see that the failure isn't isolated to a single event; just trying to run the server with stop states enabled generated two faults in four days:
|
What level/commit are you using for hostboot and hostboot-binaries? |
Hostboot 1e1ea69f, binaries (ring files) 2b4eedbd, HCODE 003286ea4 |
@madscientist159 There was PGPE Hcode commit merged back in Feb/Mar that exacerbated a long standing bug. This fail signature of this issue matches with that. This issue was discovered end of May and a fix was put our in early Jun. Specifically, this commit fixes that issue and was pulled into open power on Jun 6th Please pull that in and re-try. If the problem persists, then let me know. |
@rbatraAustinIBM Unfortunately the problem persists -- in fact it's gotten worse, when it disabled the BMC fan controls it locked one of the CPU fans to low speed so we had a box with a significantly overheating CPU until it was caught. Notably we have two machines that seem far more susceptible to this than the others, I'm not sure if this is because of workload or potentially marginal / faulty CPUs. The SEL recorded a single instance of the same failure code as before on CPU1. opal-prd shows crash / recovery (attached), but the BMC again was unable to reestablish sensor reads so it locked the fan to a setting that could damage hardware. BMC dmesg:
|
Digging a little further, the BMC was eventually able to restore communication, but it doesn't look like the OpenBMC userspace is written to handle the OCC dropping off and reappearing. Our watchdog (added in because the IBM thermal monitoring would lock up and drop off for a variety of reasons) looks for occ_error to be non-zero -- during / after this event occ_error remained zero so the userspace processes to read CPU1 temperatures on the BMC remained hung. Until the BMC software stack can reliably handle the OCC disappearing and reappearing, I'm considering this a serious bug -- one of the main concerns is who has liability right now for CPUs that are destroyed due to overheating once the OCC crashes. If there's anything I can do to assist with debugging this please let me know ASAP. Thanks! |
There are two different issues here, one is the OCC going thru a reset which the BMC must handle and this should not be causing fans to go low, fans should go high when the BMC detects that it cannot get temperatures anymore from the OCC. The BMC team should be looking at the fan issue. Did you gather the esels from this last time so we can see why the OCC was reset? |
@marthabroyles What's happening is the BMC user space isn't recovering from the OCC dropping off / reappearing. Working theory right now is that the hwmon DBUS proxy isn't detecting the device node going away, so keeps trying to read from a stale handle, since the process is present and the BMC kernel is able to read the temps after OCC recovery has finished. This means the fan controls are using stale data from when the CPU was in throttled mode, and allowing the reactivated CPU to overheat. What I still don't understand is why the OCC resets are hitting only two machines out of a large number. There was only one eSEL generated, for the OCC read failure BMC-side. Since this time the OCC recovered and the BMC fan controls could be manually reset, the machine is still online so I can gather additional info if needed. |
@madscientist159 asked me to comment here, because I've also met the
messages, happened twice in cca 1 year on my Talos with their last stable firmware. In both cases it was when I started a KVM based VM. Don't think I can provide more info except kernel versions for host & guest. |
@madscientist159 Chris Cain, pointed out from the HBRT log the following information about the image build level you are running with. Jul 14 13:00:49 host opal-prd: HBRT: ISTEPS_TRACE:pullTraceBuildInfo: image build date = 20190713, build time = 0011, build tag = hw042519a.940 As you can see, the build tag is hw042519a.940 which is April 25th. However, the fix that I mentioned earlier went in on Jun 6th, so you will need to run with HW Image tag which at least Jun 5th level or higher. It should be named hw060519a.940 |
@rbatraAustinIBM We build our own HCODE images, the linked patch was applied to our existing tree and the HCODE was rebuilt. That is why the tag is still the older string value. Was there an additional patch that fixed this problem as well? We don't really want to have to go through the entire months-long re-qualification process starting with beta firmware again, which would be needed if we pulled the newest HCODE at this point. |
FYI, The symptoms in that last opal-prd-fail.log (from yesterday) were the same as the original failure. Which is why I had @rbatraAustinIBM confirm the code level. |
@madscientist159 You mentioned you are seeing this problem on couple of machines only. Have you ever seen the problem on these machines before even with older code level? Also, what was the previous working code level in terms HW image tag? |
@cjcain Understood, and that's why I'm wondering if there was a different patch that is also needed. @rbatraAustinIBM Yes, we saw this problem basically from day 1. It's persisted across parts swaps (first thought was defective CPU / mainboard) and multiple firmware levels including revisions from back in 2018. That particular machine is a heavily loaded production VM host, which is also why it is not a trivial matter to get approval to test a firmware level against that particular system. The other system still needs to have a parts swap done as a matter of routine to rule out hardware problems, but may be a better test bed if that pans out as it does not handle the same level of critical workloads. |
The handling of OCC resets should be looked at from the BMC side. The BMC is told whenever the OCC is being reset so the BMC will stop trying to communicate with the OCC and adjust the fans accordingly (higher) for not having temperatures from the OCC. Once the OCC reset is complete and the OCCs are running again the BMC is again informed so it can start communicating with the OCCs again. We use an OCC reset as a recovery method which must work, it is designed to allow 3 OCC resets before we consider there to be a real problem. In this case the OCC reset did recover the OCC but it sounds like the BMC did not recover which is a big problem. OCC resets is a known architected recovery mechanism that the BMC should be handling. |
To handle the OCC reset (active status), there is an openpower OCC control application(https://github.com/openbmc/openpower-occ-control) that was created to set an associated OCC active property on dbus and bind/unbind the driver per OCC. The active property controlled by this application would be the signal that can be configured within fan control to set fans to full speed and unload any events that use this active property as a precondition. Is this application being used and is fan control configured with a precondition against all OCC active properties that must pass to allow fan speed changes otherwise fans run at a set speed, i.e. full speed? Once all OCCs become active again, the fan control's configured precondition would pass, allowing the events configured to be loaded, subscribing to the OCC sensors and enabling fan speed changes. (https://github.com/openbmc/phosphor-fan-presence/blob/master/control/preconditions.cpp#L17-L76) |
@msbarth Yes, it is being used and it did work as intended in every OCC failure up until the last one. The BMC revision level did not change, for some reason this latest OCC failure was not properly detected by the BMC and therefore the fans did not ramp up as they normally do (the ramped up fans cause a massive temperature drop, which shows on our rack management systems as an alert; this normally the first indication we get that the OCC has failed). |
@madscientist159 Ok, I've mentioned this issue to the maintainer of the OCC control application to take a look into that OCC-to-BMC communication path. |
@madscientist159 I see from the BMC dmesg that there are errors communicating with the OCC. Can you tell if this is before, during, or after the OCC was reset? What kernel is on your BMC? You also said "the BMC kernel is able to read the temps after OCC recovery has finished." This means that the BMC recovered after the OCC reset fine? I don't think it's possible there is stale temperature data coming from the BMC OCC driver. If it can't get the poll response, it will return an error when the hwmon files are read. |
@eddiejames The dmesg failure entries on host and guest start at exactly the same time down to the second. I'm not comfortable trying to infer order below 1 second resolution since the BMC and host dmesg stamps can drift slightly compared to one another. Yes, the BMC kernel recovered after the OCC reset this time. The BMC userspace for whatever reason did not. Normally the OCC reset knocks the BMC kernel out too, requiring a reboot of host and BMC. Kernel 5.0.7 is on the BMC. |
Normally the OCC reset breaks the BMC? That seems odd. But there were a couple of fixes put into 5.1 for the OCC driver that could be helpful here. Specifically: Without this, it was possible to get stale data. |
@madscientist159 Ok, so we still need to understand why the failure occured in the first place. If you have the commit, then we would need to debug this further. Can you please gather the BMC FFDC at the point of failure? |
Let us first confirm that the PM Complex can get reset and still maintain communication with the BMC.
Then you can manually issue a reset (takes ~15 seconds) and then you should see the OCCs go active again and the reset count increment:
If you continue to issue reset commands (total of 4), you can verify the system goes to safe mode:
The OCCs/PM Complex should be held in reset and the system is running in safe mode. At this point, an eSEL should have been generated and sent to the BMC. This test should verify the full reset/recovery process as well sending eSELs to the BMC. You can send me the BMC FFDC and opal_prd.log if you need help debugging. To recovery you can just reboot the system. |
In email from @madscientist159 (regarding testing the manual OCC resets - see my last comment):
First I want to confirm:
Regarding the traces from this issue: The initial failure you hit did have a fix and if you have applied Rahul's patch to your driver (open-power/hcode@cade833) and continue to see the system enter safe mode, then we will need to see additional data to debug the new failure. If you are seeing the host hang, you can still reboot the host and then collect the data (since it should not be lost across reboots). There should always be an eSEL that is generated and sent to the BMC anytime the system is left in safe mode. This is what we use to determine what triggered safe mode. I did not see any eSELs listed in this issue. Since the build that is being used on your systems is missing many fixes/commits, we need additional data to try to narrow down what you are seeing and if there are additional commits you need to pick up. |
@madscientist159 Let's make sure you have all the known fixes for the reason the reset is being asked for in the first place (PGPE VFRT timeout) there is another PGPE fix for this issue since the previous level @rbatraAustinIBM pointed you to on July 11th. @rbatraAustinIBM is the latest VFRT timeout fix available? Please update here the fix they need to pull in. |
We recently fixed a problem with similar fail signature. However, that commit is not yet available on openpower github. Once, it's available, I will notify. Also, hold on trying to pull anymore FFDC until then |
@rbatraAustinIBM Is the commit available yet? We're seeing this failure at a customer site post DD2.3 upgrade; it's knocking the fan controls offline and providing a poor customer experience. |
@madscientist159 Yes, the commit should be available now. Can you tell what op driver level is this machine on? |
We build the firmware from source, so our driver level usually doesn't match the IBM driver levels. Can you point to the commit so that we can pull that revision and test? |
@rbatraAustinIBM Were you able to locate this information? We have successfully replicated the customer site failure in our lab on DD2.3 CPUs; it appears to be triggered by compilation heavy workloads. |
Hi, sorry for the delayed response. I didn't see your previous response until now. Here are the commits. https://rchgit01.rchland.ibm.com/gerrit1/#/c/84277/(PGPE: Multiple Droop Handling Fix) |
Also, note, these commits may or may not fix your problem. If, not, then we have a different problem. |
After a random amount of time, the OCC locks up and downclocks the host if default Linux power saving modes are enabled. When this happens, not only is the machine slowed down but fan controls are completely disabled and a full host power-off reboot is required to restore functionality.
opal-prd occ reset
is sometimes able to restore proper clocks without a reboot but the fan controls remain offline without a full reboot.Host dmesg:
BMC dmesg:
The following command, run at host start, completely prevents the issue from occurring, but is undesirable as the machines use significantly more power at idle:
echo 1 | tee /sys/devices/system/cpu/cpu*/cpuidle/state?/disable
Host Linux version: 4.19.0-5-powerpc64le
OCC GIT hash: 58e422d
The text was updated successfully, but these errors were encountered: