flake: failing during image pull when building podinfo-flux package in test-external #3194

AustinAbro321 · 2024-11-06T18:42:31Z

Describe what should be investigated or refactored

Seeing a flake in the test-external workflow. Images are failing to be saved.

  •  Fetching info for 9 images. This step may take several seconds to complete.
  •  Fetched info for 9 images
  •  Pulling 9 images (0.00 Byte of 243.74 MBs)

 WARNING  Failed to save images in parallel, falling back to sequential save: All attempts fail:
          #1: error writing layer: expected blob size 3419706, but only wrote 3207362
          #2: error writing layer: expected blob size 3419706, but only wrote 3207362
     ERROR:  failed to create package: All attempts fail:
             #1: error writing layer: expected blob size 3419706, but only wrote 3207362
             #2: error writing layer: expected blob size 3419706, but only wrote 3207362
    common.go:33: 
        	Error Trace:	/home/runner/work/zarf/zarf/src/test/external/common.go:33
        	            				/home/runner/work/zarf/zarf/src/test/external/ext_in_cluster_test.go:165

AustinAbro321 · 2024-12-20T14:28:49Z

I've validated that this is not caused by by disk space as the error in this case will look different

failed to create package: All attempts fail:
             #1: error writing layer: write
             /tmp/zarf-2081439845/images/blobs/sha256/000f791482e95f5e804ace91e5d39e0d48723c758a6adc740738cc1f9cd296153189335322:
             no space left on device
             #2: error writing layer: write
             /tmp/zarf-2081439845/images/blobs/sha256/000f791482e95f5e804ace91e5d39e0d48723c758a6adc740738cc1f9cd296152478328656:
             no space left on devic

AustinAbro321 · 2024-12-20T15:28:55Z

I run a script to build the podinfo-flux package (what the test flakes on) 100 times in two different terminals in parallel. I was not able to repeat this error.

@RothAndrew reported that a similar error happening to him during his day to day with a separate private package. It does not happen to him in the images are not in the Zarf cache. This follows because in our usual e2e tests we delete the zarf cache right away for storage purpose, likely that is what's causing the flake to only appear in the test-external workflow

RothAndrew · 2024-12-20T15:38:33Z

It happens so persistently for me that I ended up doing this pretty much anywhere I’m making zarf packages now. https://github.com/defenseunicorns-partnerships/wfapi/blob/main/scripts/build_zarf_package.sh

RothAndrew · 2024-12-20T15:46:48Z

I run a script to build the podinfo-flux package (what the test flakes on) 100 times in two different terminals in parallel. I was not able to repeat this error.

I wonder if the image size or number of layers makes a difference when trying to recreate. Podinfo is much smaller than most of the images I work with.

AustinAbro321 · 2024-12-20T15:48:07Z

@RothAndrew Has it ever happened with only one image?

Every failure I looked at in test external - https://github.com/zarf-dev/zarf/actions/workflows/test-external.yml?query=is%3Afailure fails with expected blob size 3419706. Decompressing the package and running find . -type f -size 3419706c on the layers returns ./94c7366c1c3058fbc60a5ea04b6d13199a592a67939a043c41c051c4bfcd117a. This is base layer for these six images in the package. Possibly having multiple images grabbing the same layer makes this flake more likely.

  - ghcr.io/fluxcd/helm-controller:v1.1.0
  - ghcr.io/fluxcd/image-automation-controller:v0.39.0
  - ghcr.io/fluxcd/image-reflector-controller:v0.33.0
  - ghcr.io/fluxcd/kustomize-controller:v1.4.0
  - ghcr.io/fluxcd/notification-controller:v1.4.0
  - ghcr.io/fluxcd/source-controller:v1.4.1

RothAndrew · 2024-12-20T15:57:32Z

Has it ever happened with only one image?

I'm not sure. I feel like it definitely happens more when there are multiple images, or the images are large, or the registry it is pulling from is slow

AustinAbro321 · 2024-12-20T17:57:28Z

Pretty sure I found the issue, Zarf was not properly deleting invalid layers from the cache when they occur. @RothAndrew Feel free to test out #3358, though either way the team will see in time if the flake disappears

AustinAbro321 · 2025-01-13T18:14:46Z

While #3358 definitely did solve some incorrect logic, we are still seeing this error - https://github.com/zarf-dev/zarf/actions/runs/12752633384/job/35542502740?pr=3398.

I am unable to reproduce locally. Even when directly putting in a invalid layer in the zarf cache, it now gets cleaned up properly.
@RothAndrew Are you still seeing this error in v0.46.0? If so, do you have a public package I can test with?

RothAndrew · 2025-01-13T20:26:28Z

Not sure. I’ll keep an eye out.

AustinAbro321 added the tech-debt 💳 label Nov 6, 2024

github-project-automation bot added this to Zarf Nov 6, 2024

github-project-automation bot moved this to Triage in Zarf Nov 6, 2024

schristoff removed the tech-debt 💳 label Nov 8, 2024

AustinAbro321 mentioned this issue Dec 20, 2024

fix: properly delete invalid layers during image pull #3358

Merged

2 tasks

AustinAbro321 mentioned this issue Jan 24, 2025

test: avoid flake in test external #3432

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flake: failing during image pull when building podinfo-flux package in test-external #3194

flake: failing during image pull when building podinfo-flux package in test-external #3194

AustinAbro321 commented Nov 6, 2024

AustinAbro321 commented Dec 20, 2024

AustinAbro321 commented Dec 20, 2024

RothAndrew commented Dec 20, 2024

RothAndrew commented Dec 20, 2024

AustinAbro321 commented Dec 20, 2024 •

edited

Loading

RothAndrew commented Dec 20, 2024

AustinAbro321 commented Dec 20, 2024

AustinAbro321 commented Jan 13, 2025

RothAndrew commented Jan 13, 2025

flake: failing during image pull when building podinfo-flux package in test-external #3194

flake: failing during image pull when building podinfo-flux package in test-external #3194

Comments

AustinAbro321 commented Nov 6, 2024

Describe what should be investigated or refactored

AustinAbro321 commented Dec 20, 2024

AustinAbro321 commented Dec 20, 2024

RothAndrew commented Dec 20, 2024

RothAndrew commented Dec 20, 2024

AustinAbro321 commented Dec 20, 2024 • edited Loading

RothAndrew commented Dec 20, 2024

AustinAbro321 commented Dec 20, 2024

AustinAbro321 commented Jan 13, 2025

RothAndrew commented Jan 13, 2025

AustinAbro321 commented Dec 20, 2024 •

edited

Loading