Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flake: failing during image pull when building podinfo-flux package in test-external #3194

Open
AustinAbro321 opened this issue Nov 6, 2024 · 7 comments

Comments

@AustinAbro321
Copy link
Contributor

Describe what should be investigated or refactored

Seeing a flake in the test-external workflow. Images are failing to be saved.

Workflow run
Relevant logs:

  •  Fetching info for 9 images. This step may take several seconds to complete.
  •  Fetched info for 9 images
  •  Pulling 9 images (0.00 Byte of 243.74 MBs)

 WARNING  Failed to save images in parallel, falling back to sequential save: All attempts fail:
          #1: error writing layer: expected blob size 3419706, but only wrote 3207362
          #2: error writing layer: expected blob size 3419706, but only wrote 3207362
     ERROR:  failed to create package: All attempts fail:
             #1: error writing layer: expected blob size 3419706, but only wrote 3207362
             #2: error writing layer: expected blob size 3419706, but only wrote 3207362
    common.go:33: 
        	Error Trace:	/home/runner/work/zarf/zarf/src/test/external/common.go:33
        	            				/home/runner/work/zarf/zarf/src/test/external/ext_in_cluster_test.go:165
@AustinAbro321
Copy link
Contributor Author

I've validated that this is not caused by by disk space as the error in this case will look different

failed to create package: All attempts fail:
             #1: error writing layer: write
             /tmp/zarf-2081439845/images/blobs/sha256/000f791482e95f5e804ace91e5d39e0d48723c758a6adc740738cc1f9cd296153189335322:
             no space left on device
             #2: error writing layer: write
             /tmp/zarf-2081439845/images/blobs/sha256/000f791482e95f5e804ace91e5d39e0d48723c758a6adc740738cc1f9cd296152478328656:
             no space left on devic

@AustinAbro321
Copy link
Contributor Author

I run a script to build the podinfo-flux package (what the test flakes on) 100 times in two different terminals in parallel. I was not able to repeat this error.

@RothAndrew reported that a similar error happening to him during his day to day with a separate private package. It does not happen to him in the images are not in the Zarf cache. This follows because in our usual e2e tests we delete the zarf cache right away for storage purpose, likely that is what's causing the flake to only appear in the test-external workflow

@RothAndrew
Copy link
Contributor

It happens so persistently for me that I ended up doing this pretty much anywhere I’m making zarf packages now. https://github.com/defenseunicorns-partnerships/wfapi/blob/main/scripts/build_zarf_package.sh

@RothAndrew
Copy link
Contributor

I run a script to build the podinfo-flux package (what the test flakes on) 100 times in two different terminals in parallel. I was not able to repeat this error.

I wonder if the image size or number of layers makes a difference when trying to recreate. Podinfo is much smaller than most of the images I work with.

@AustinAbro321
Copy link
Contributor Author

AustinAbro321 commented Dec 20, 2024

@RothAndrew Has it ever happened with only one image?

Every failure I looked at in test external - https://github.com/zarf-dev/zarf/actions/workflows/test-external.yml?query=is%3Afailure fails with expected blob size 3419706. Decompressing the package and running find . -type f -size 3419706c on the layers returns ./94c7366c1c3058fbc60a5ea04b6d13199a592a67939a043c41c051c4bfcd117a. This is base layer for these six images in the package. Possibly having multiple images grabbing the same layer makes this flake more likely.

  - ghcr.io/fluxcd/helm-controller:v1.1.0
  - ghcr.io/fluxcd/image-automation-controller:v0.39.0
  - ghcr.io/fluxcd/image-reflector-controller:v0.33.0
  - ghcr.io/fluxcd/kustomize-controller:v1.4.0
  - ghcr.io/fluxcd/notification-controller:v1.4.0
  - ghcr.io/fluxcd/source-controller:v1.4.1

@RothAndrew
Copy link
Contributor

Has it ever happened with only one image?

I'm not sure. I feel like it definitely happens more when there are multiple images, or the images are large, or the registry it is pulling from is slow

@AustinAbro321
Copy link
Contributor Author

Pretty sure I found the issue, Zarf was not properly deleting invalid layers from the cache when they occur. @RothAndrew Feel free to test out #3358, though either way the team will see in time if the flake disappears

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

3 participants