-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e browser crashes #579
Comments
With current Fedora 39, Chromium often runs into these crashes: ``` AVC avc: denied { execheap } for pid=3725873 comm="ThreadPoolForeg" scontext=system_u:system_r:container_t:s0:c439,c758 tcontext=system_u:system_r:container_t:s0:c439,c758 tclass=process permissive=0 systemd-coredump[3725903]: [Process 3725873 (headless_shell) of user 1111 terminated abnormally without generating a coredump. ``` This causes the browser to hang, and the test force-killed after the timeout. The general tone in past bug reports is that this is intended -- heap shouldn't be writable and executable at the same time. So Chromium is doing something dubious here. This isn't something we can easily influence, so just poke a hole into the policy. This was generated by collecting the full denial with `ausearch -c 'ThreadPoolForeg' --raw`, and feeding the result into `audit2allow`: ``` echo 'type=AVC msg=audit(1709124430.347:10808): avc: denied { execheap } for pid=150495 comm="ThreadPoolForeg" scontext=system_u:system_r:container_t:s0:c114,c423 tcontext=system_u:system_r:container_t:s0:c114,c423 tclass=process permissive=1' | audit2allow -M cockpituous ``` Fixes #579
One day later after the upgrade, we now see the same violation on all our machines, including rhos:
This seems to be a regression between selinux-policy-targeted-38.28-1.fc39.noarch and selinux-policy-targeted-39.3-1.fc39.noarch then. There are some similar reports , and the general tone seems to be that this is intended -- heap shouldn't be writable and executable at the same time. So Chromium is doing something dubious here. The other dubious thing is this "c114,c423" type range, but I don't know enough about SELinux to know if that's intended. Combining
This can be built and activated with
I sent PR #580 to ansiblify this and rolled it out. Let's check tomorrow if it worked. |
We changed another variable: cockpit-project/cockpit#20107 (comment) The tasks container now has chromium 122 which fixed at least one "NOTREACHED" assertion. If there are no violations tomorrow, try to undo the SELinux policy and see if it happens again. |
I ran So I ran |
The JavaScript JIT generates machine code at run time and thus needs executable memory. But the devel is probably in the details. |
After I've done that, there are plenty of rejections again:
So, (1) this was unrelated to the Chromium update, and (2) let's land #580 to fix this. |
With current Fedora 39, Chromium often runs into these crashes: ``` AVC avc: denied { execheap } for pid=3725873 comm="ThreadPoolForeg" scontext=system_u:system_r:container_t:s0:c439,c758 tcontext=system_u:system_r:container_t:s0:c439,c758 tclass=process permissive=0 systemd-coredump[3725903]: [Process 3725873 (headless_shell) of user 1111 terminated abnormally without generating a coredump. ``` This causes the browser to hang, and the test force-killed after the timeout. The general tone in past bug reports is that this is intended -- heap shouldn't be writable and executable at the same time. So Chromium is doing something dubious here. This isn't something we can easily influence, so just poke a hole into the policy. This was generated by collecting the full denial with `ausearch -c 'ThreadPoolForeg' --raw`, and feeding the result into `audit2allow`: ``` echo 'type=AVC msg=audit(1709124430.347:10808): avc: denied { execheap } for pid=150495 comm="ThreadPoolForeg" scontext=system_u:system_r:container_t:s0:c114,c423 tcontext=system_u:system_r:container_t:s0:c114,c423 tclass=process permissive=1' | audit2allow -M cockpituous ``` Fixes #579
With current Fedora 39, Chromium often runs into these crashes: ``` AVC avc: denied { execheap } for pid=3725873 comm="ThreadPoolForeg" scontext=system_u:system_r:container_t:s0:c439,c758 tcontext=system_u:system_r:container_t:s0:c439,c758 tclass=process permissive=0 systemd-coredump[3725903]: [Process 3725873 (headless_shell) of user 1111 terminated abnormally without generating a coredump. ``` This causes the browser to hang, and the test force-killed after the timeout. The general tone in past bug reports is that this is intended -- heap shouldn't be writable and executable at the same time. So Chromium is doing something dubious here. This isn't something we can easily influence, so just poke a hole into the policy. This was generated by collecting the full denial with `ausearch -c 'ThreadPoolForeg' --raw`, and feeding the result into `audit2allow`: ``` echo 'type=AVC msg=audit(1709124430.347:10808): avc: denied { execheap } for pid=150495 comm="ThreadPoolForeg" scontext=system_u:system_r:container_t:s0:c114,c423 tcontext=system_u:system_r:container_t:s0:c114,c423 tclass=process permissive=1' | audit2allow -M cockpituous ``` Fixes #579
We have a lot of random test crashes like this or this where chromium goes bananas and gets stuck, and the test eventually times out. I've never seen this with firefox nor on RHOS. Sometimes it's so bad that ¾ of the tests fail that way.
I suspect RAM over-commitment, so possibly we need to dial down the parallelism.
example Feb 20
example Feb 26
example Feb 27
[mvo] I can have a look, to learn more about our testing infrastructure.
Reducing TEST_JOBS seems like a good idea, it's at 8 currently.
This has happened on rhos, but doesn't have the browser OOM message. Other tests include this:
correct theory: SELinux denials
At the machines where this happens, there are thousands of SElinux rejections:
and some
Checking PSI cluster (rhos01):
→ not a single SELinux denial. Since January, there are a 11 instances of
and that's a sign that we stress the rhos01 instances quite a bit. But it happens rarely enough that it's not an immediate concern IMHO.
On Feb 27 15:45 CET, @martinpitt ran
ansible -f20 -i inventory -m shell -a 'setenforce 0' e2e
. Let's watch out for failures in the next days, and see if it still happens. If not, and that's it, then we can see if we can fix this in a finer-grained way.A big difference between rhos and e2e is that the former runs "classic" Fedora 39 server, while the latter runs Fedora IoT (also based on 39). But that shouldn't matter much for the on-disk runtime bits. However, all our machines are quite a bit behind in terms of upgrades (they don't matter much, these aren't internet-facing machines):
On rhos (classic rpm):
on e2e (IoT):
On Feb 28 08:15 CET, all our machines were dnf/ostree upgraded to latest package versions and rebooted, so that they are now very similar. They both have selinux-policy-targeted-39.4-1.fc39.noarch and podman-4.9.3-1.fc39.x86_64. e2e has kernel-core-6.7.5-200.fc39.x86_64, while rhos has kernel-core-6.7.6-200.fc39.x86_64.
Also, I
setenforce 0
everywhere right away: We can still see the rejections withpermissive=1
without having to actually fail our tests. If we still get them, let's investigate/report and allow execheap in a more targeted fashion.So tomorrow, let's run
and on openstack_tasks too.
Theory: /dev/shm
Current theory: Our
--shm-size=1024M
is not enough forTEST_JOBS=8
, and recent chromium versions just need moar of it. Let's try to provoke that error in a local podman container with smaller SHM and/or--memory
, and see if it's shm or heap.Scaled down to ¼, the parameters as in production are:
which works fine locally. However, with
--shm-size=80m
the test fails very quickly withand some more noise.
podman exec -itl watch df -h /dev/shm
also shows at the start of the test (when it didn't fail yet) that it's quickly eating away space:This already happens during
image-customize
, so I take it that QEMU itself also started to use /dev/shm. With--shm-size=180m
it works well enough, with Use going up to ~ 72%. But it changes quickly, I may have missed a few peaks.A better command is
Similar with
--shm-size 140m
-- tests start to fail with "cockpit is not defined" (like above), but this time /dev/shm actually hit the root (97%).With
--shm-size=200m
the error message looks different:.. which is interesting as I think I've seen this somewhere. It could be a bug in our page of course, and totally unrelated. And it's not the OOM hang.
I just thought about a counter-argument: rhos-01 seems much less affected (if at all), but it uses the same SHM config. It only runs one container per host, as opposed to 4 on e2e, so RAM/heap over-commitment matters, but not /dev/shm.
Theory: general RAM/heap
I ran the test with
--memory=3g
which should be too little for two parallel tests (it's half the size of production) -- our nond machines alone already take > 2 GiB. Interestingly, nothing inside of the container actually seems to care --free
,top
etc. all show 16 GiB (my laptop's RAM size). Looking at the container cgroup:i.e. it does hit the roof, but nothing consequential -- the tests run happily. Then again, with
--memory=1g
the rpmbuild inside of the VM fails, so this option does something. With 2g it successfully builds, but both qemu and browser (headless_shell
) get hit over the head by the OOM killer. So, conclusions: (1) that memory limit option does work, (2) even half the allocated size is enough, and (3) it doesn't fail in the observed mode.The text was updated successfully, but these errors were encountered: