Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

10-localtime hook sometimes fails to copy the file #218

Open
krono opened this issue Nov 25, 2024 · 1 comment
Open

10-localtime hook sometimes fails to copy the file #218

krono opened this issue Nov 25, 2024 · 1 comment

Comments

@krono
Copy link
Contributor

krono commented Nov 25, 2024

With #184, enroot gained a work-around for Ubuntu-images doing post-install shenanigans for timezone files.

We (@j-hellenberg, me) found that some other ubuntu images seem to exhibit a race condition in the new code, resulting enroot/pyxis container startup to fail in the hook:

slurmstepd-cXXX: error: pyxis: container start failed with error code: 1
slurmstepd-cXXX: error: pyxis: printing enroot log file:
slurmstepd-cXXX: error: pyxis:     cp: cannot create symbolic link '/<network_share>/.local/share/enroot/pyxis_2310/etc/localtime': File exists
slurmstepd-cXXX: error: pyxis:     [ERROR] /etc/enroot/hooks.d/10-localtime.sh exited with return code 1
slurmstepd-cXXX: error: pyxis: couldn't start container

A workaround seems to add --force to the cp's in

cp --no-dereference --preserve=links /etc/localtime "${ENROOT_ROOTFS}/etc/localtime"
and
cp "${target}" "${ENROOT_ROOTFS}${target}"
.

@3XX0 suggested investigating the shared/network filesystem, because the hooks are actually running in a flocked scope:

flock -w 30 "${_lock}" > /dev/null 2>&1 || common::err "Could not acquire rootfs lock"


I investigated the flocking behaviour and found:

  1. Flocks are supported in our environment (GPFS)
  2. I managed to force the race with the "flocker.sh" below, but ONLY after I introduced the sleep 0.05 AND the 10x cp
  3. Other uses of cp do NOT complain, even if the destination exists.

maybe noteworthy, lslocks does show the lock as UNIX lock, as opposed to other locks that show up as POSIX or FLOCK.

I am at loss what "cannot create symbolic link '...': File exists" should actually mean since other times, cp happily overwrites the destination, even without --force

`flocker.sh`
#!/bin/bash
set -ex
set -o pipefail
PARENT=${1-$PWD}
echo $PARENT
FLOCKER="${PARENT}/flock"
LOCK="${FLOCKER}/LOCK"


err() {
  echo "$1"
  exit 1
}


leflock() {
(

  flock -w 30 "${_lock}" ||  err "no lock can do"
  #sleep 1
  sleep 0.05
  stat "${PARENT}/SRC" | grep -v id || err "no SRC yet?"
  stat "${FLOCKER}/DST" | grep -v id || echo "no DST yet,OK"
  Z=10
  while  (( Z-- )); do
    cp --no-dereference --preserve=links "${PARENT}/SRC" "${FLOCKER}/DST"
  done
  #sleep 1

) {_lock}>"${LOCK}"

}

echo == prepare ==
mkdir -p ${FLOCKER}
ln -sf /tmp/definitiely/no/extisting/file/or/directory/cannot/make/that/up "$PARENT/SRC"


echo == go ==
X=120
while (( X-- )); do
  leflock
done
echo == done ==
@krono
Copy link
Contributor Author

krono commented Nov 25, 2024

maybe this behavior is related: https://github.com/coreutils/coreutils/blob/cb2774501d7a292900a8a547d6c7cdc62f90c7fb/src/copy.c#L3116-L3119

So maybe it works most of the times because the copy succeeds due to the "special case", but sometimes in the race it does not work and we end up in the regular error case: https://github.com/coreutils/coreutils/blob/cb2774501d7a292900a8a547d6c7cdc62f90c7fb/src/copy.c#L3130

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant