Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CI pipeline and integration tests #911

Open
dklimpel opened this issue May 12, 2024 · 43 comments
Open

Improve CI pipeline and integration tests #911

dklimpel opened this issue May 12, 2024 · 43 comments
Labels

Comments

@dklimpel
Copy link
Contributor

dklimpel commented May 12, 2024

I have opened a few PR to move from Travis CI to Github and update integration tests.

The following proposal for processing the PR.

Independent of this, new feature:

Independent of this, fix docs:

  1. Fix building mkdocs #900
  2. Update docs to fix linting issues #904
  3. fix path pattern in documentation pipeline #919

bring back linter:

  1. Add github golang lint CI pipeline #899
  2. later some PR fix linting issues
  3. fix some linting failures and configure golangci-lint #922
  4. bring back linter for windows

update integration test docker images:

  1. bump alpine linux to 3.19 #907
  2. Add Debian Bookworm to integration-tests #902
  3. Replace centos7 with rockylinux9 in integration-tests #906
  4. Add Ubuntu Noble to integration-tests #908
  5. new PR to remove old images
  6. Build own docker images for integration tests #901
  7. new PR to uses new docker images from ghcr (incl. new MD5 checksums to bypass build in CI)
  8. decide when to build new docker test images

After that it is possible to move from Travis CI to Github Actions.

It may be useful to provide all CIs with a uniform wording and an optimized sequence at the end.

@aelsabbahy
Copy link
Member

aelsabbahy commented May 16, 2024

You've been busy, love it!

I'll go through these PRs and start reviewing/merging the ones with least dependencies.

Two questions regarding integration-tests:

  1. Did you ever find a fix for the systemd in docker thing?
  2. I haven't looked at the PRs yet, do they leverage the make targets?
    • If not, I want to make sure it's reproducible locally outside of CI if possible.
    • There's also act - written by a former colleague of mine, one of the best engineers I had the pleasure of working with. I've never personally used it.

@dklimpel
Copy link
Contributor Author

Did you ever find a fix for the systemd in docker thing?

The topic is very annoying. Docker is simply not made for running multiple processes and monitoring them from a separate process (systemd)- there are solutions to start multiple processes in a container. But that won't help, because the container is used to simulate an operating system. The whole thing can only be solved by calling docker run with the correct parameters.

I haven't looked at the PRs yet, do they leverage the make targets?

All these changes run with Travis CI, Github workflow and locally on my ubuntu machine (make test-all)

One issue remains open from my point of view. The test in which runlevel a service runs is not compatible with current linux versions. My knowledge of go is not great enough to investigate this.

@aelsabbahy
Copy link
Member

@dklimpel random question, are you on gophers slack by any chance?

@aelsabbahy
Copy link
Member

Between #900 and #904 is there a dependency or order to merge them. Both have failing builds currently.

After those are merged, then the long awaited journey to move off of Travis begins.

@dklimpel
Copy link
Contributor Author

Both have different failing jobs. And both fix different jobs. I would start with #900 and then merge main to #904. Then should #904 not failing anymore.

@aelsabbahy
Copy link
Member

#900 merged, #904 updated

@aelsabbahy
Copy link
Member

One issue remains open from my point of view. The test in which runlevel a service runs is not compatible with current linux versions. My knowledge of go is not great enough to investigate this.

Which Linux (branch) is failing, I can checkout that brance and investigate next week. Wonder if it's a Goss bug, or working as intended.

@dklimpel
Copy link
Contributor Author

dklimpel commented May 18, 2024

Which Linux (branch) is failing, I can checkout that brance and investigate next week. Wonder if it's a Goss bug, or working as intended.

It is a problem with Debian and Ubuntu:

Also on my local Ubuntu machine.

@aelsabbahy
Copy link
Member

Okay, thanks. I'll check them both out.

Which PRs are next for documentation?

#919?

@dklimpel
Copy link
Contributor Author

I have tried to put it into a sorted list above.

#904 fix the linting issues. After that there should be a valid documentation. #919 fixing for upcoming changes that the documentation pipeline becomes triggered.

@dklimpel
Copy link
Contributor Author

GitHub creates a workflow when push to master automatically.
https://github.com/goss-org/goss/actions/workflows/pages/pages-build-deployment

IMHO this is failing because this is a default job with Jekyll theme. Probably this can be disabled in project settings. https://github.com/goos-org/goss/settings/pages - Change to source "GitHub Action". Readthedocs should not need this, because it is working by triggers, I think. GitHub pages is not used here.

@aelsabbahy
Copy link
Member

aelsabbahy commented May 19, 2024

Changed, I guess next PR to be merged will validate this?

Also, please continue to use this issue to let me know the next PR in the chain. I find it a lot easier to track on here.

This is an amazing level of work by the way, much appreciated. It's something I've wanted for a long time. Unfortunately, due to limited time I never got around to it, my focus tends to be:

  1. Bugs/security findings
  2. Features
  3. Everything else (refactor, CI, etc..)

@dklimpel
Copy link
Contributor Author

Changed, I guess next PR to be merged will validate this?

Yes, it is.

My suggestion for the next steps.

@dklimpel
Copy link
Contributor Author

@dklimpel random question, are you on gophers slack by any chance?

Unfortunately not.

@aelsabbahy
Copy link
Member

aelsabbahy commented May 25, 2024

Hey @dklimpel , if you don't mind. Let me know on here the next PR that's ready and I'll review.

This is an awesome amount of work you put it, it's greatly appreciated!

@dklimpel
Copy link
Contributor Author

To improve the code:

@aelsabbahy
Copy link
Member

Holding off on merging more PRs until Travis oss credits are replenished.

Don't want CI to get in a broken state with unclear traceability on what caused it.

This is an awesome amount of work. Can't thank you enough for taking the time to do this.

@dklimpel
Copy link
Contributor Author

Holding off on merging more PRs until Travis oss credits are replenished.

Ok.

@dklimpel
Copy link
Contributor Author

I think #928 can help. This enables unit tests with GitHub Pipeline.

@aelsabbahy
Copy link
Member

Using this as a central coordination point. What PRs are ready for merge, so I can start going through them.

@dklimpel
Copy link
Contributor Author

I would recommend finalizing a few topics before we take a look at the docker images.

@aelsabbahy
Copy link
Member

aelsabbahy commented Jul 8, 2024

We are unable to start your build at this time. You exceeded the number of users allowed for your plan. Please review your plan details and follow the steps to resolution.

Dealing with more travis-ci issues, waiting on support. Figured it's relevant to this issue given the work-effort here =)

@dklimpel
Copy link
Contributor Author

dklimpel commented Jul 8, 2024

If you would like to run the integration tests with GHA:

@dklimpel
Copy link
Contributor Author

dklimpel commented Jul 9, 2024

I have added a replacement of CentOS in:

The creation of binarries for tags / releases is included there:

IMHO this would be the last step to replace the functionalities of travis CI.

After that, there will certainly be some clean-up work to do, such as renaming variables like "$TRAVIS_TAG".

@aelsabbahy
Copy link
Member

Travis-ci seems to be failing on rockylinux9 build, but I can't figure out the reason to be honest.

Last good build: https://app.travis-ci.com/github/goss-org/goss/builds/271468694?serverType=git

failing build: https://app.travis-ci.com/github/goss-org/goss/builds/271478309?serverType=git

The git commit sha is the same, do you know of anything else that could have changed on the CI side between a regular build vs a release build?

Also, I noticed this issue with GHA on releases: https://github.com/goss-org/goss/actions/runs/9982852891

Just putting all issues I've encountered so far in one comment.

@dklimpel
Copy link
Contributor Author

Travis-ci seems to be failing on rockylinux9 build, but I can't figure out the reason to be honest.

There is something going wrong with the user agent.

HTTP: http://httpbin/headers: Body:
Expected
    "object: *http.cancelTimerBody"
to have patterns
    ["\"Foo\": \"bar\"","\"User-Agent\": \"goss/0.0.0\""]
the missing elements were
    ["\"User-Agent\": \"goss/0.0.0\""]

I can take a look deeper the next days.

@dklimpel
Copy link
Contributor Author

failing build: https://app.travis-ci.com/github/goss-org/goss/builds/271478309?serverType=git

It is flaky or a problem with Travis.
I had run a current build in #950 and it was successfully: https://app.travis-ci.com/github/goss-org/goss/jobs/624212980

@aelsabbahy
Copy link
Member

Odd, I reran the failed test like 5 times and the same error every time. Let me try pushing a new tag.. I wonder if travis has some logic to re-use the same flaky instance on retries.

@aelsabbahy
Copy link
Member

Oh, I see what the issue is! it's not travis. The test is written in a way that assumes that goss is always a dev build aka version 0.0.0, yet when a real build happens it fails due to it having a real version.

@dklimpel
Copy link
Contributor Author

The test is written in a way that assumes that goss is always a dev build aka version 0.0.0, yet when a real build happens it fails due to it having a real version.

Good catch!

@aelsabbahy
Copy link
Member

Ok, so I'm slightly confused on this one.. since it was goss v0.4.7 that had the user-agent change.

I'm guessing something between the two versions changed where we no longer test against a CI build but rather the release build. Overall, this is a great change since we should be testing what we're releasing, just wondering if you know where this change was made.. didn't think we made much changes to travis-ci 🤷

@dklimpel
Copy link
Contributor Author

just wondering if you know where this change was made.. didn't think we made much changes to travis-ci 🤷

I was also confused and do not know what the change was. There was no change at the code and no change to Travis.

@dklimpel
Copy link
Contributor Author

Possible next steps:

@dklimpel
Copy link
Contributor Author

@aelsabbahy

What is the best way to proceed with the new rockylinux image? Updating the apache version every week is certainly not a good option.

  • Deactivate the rockylinux test until the image is uploaded static in a repo
  • Check the version number of apache with regex, if possible
  • Deactivate the version check temporarily for the image
  • Other solutions?

@aelsabbahy
Copy link
Member

Other solutions?

I upload the latest version. Does the script for building images work for rocky Linux out the box or do I need to tweak it.

Also, does GHA support manual actions that are only triggered by a user, or do they always start from a git event?

@dklimpel
Copy link
Contributor Author

I upload the latest version. Does the script for building images work for rocky Linux out the box or do I need to tweak it.

This scripts work out of the box:

build-images:
	$(info INFO: Starting build $@)
	development/build_images.sh

push-images:
	$(info INFO: Starting build $@)
	development/push_images.sh

But there is no condition for a specific image. All or nothing. And not all images are ready for building. See there: https://github.com/goss-org/goss/pull/901/files

Also, does GHA support manual actions that are only triggered by a user, or do they always start from a git event?
I have updated #901 so that it only runs manually and not automatically.

@aelsabbahy
Copy link
Member

Looping back here to coordinate outstanding PRs.

Any particular order they need to be reviewed in. Feels like we're almost at the cut-over point.

@aelsabbahy
Copy link
Member

Also, FYI: https://github.com/goss-org/goss/actions/runs/10620900695

Seems I'm missing some permissions somewhere. Did I miss a step you asked me to do?

@dklimpel
Copy link
Contributor Author

Also, FYI: https://github.com/goss-org/goss/actions/runs/10620900695

Seems I'm missing some permissions somewhere. Did I miss a step you asked me to do?

I guess that's the difference between a personal GH account and an organisation. Similar to: #909 (comment)

The package was created. I suspect that the repo has no permissions there: https://github.com/goss-org/goss/pkgs/container/trusty

@dklimpel
Copy link
Contributor Author

Any particular order they need to be reviewed in. Feels like we're almost at the cut-over point.

I think from a pure CI point of view, the results of Travis and GH are consistent. With the exception of the failures of Travis in the Windows environment.

From this point of view, I think Travis should be disabled and a release with GH should be realised.

If this works, it can be finish the topic use docker images from ghcr.

@aelsabbahy
Copy link
Member

I think we're really close, I see three succeeding and three failed here: https://github.com/goss-org/goss/actions/runs/10816944669

@aelsabbahy
Copy link
Member

Hello, this has been a major and great contribution to goss, may thanks for doing this. CI story is so much better now.

Following up on this issue, aside from the https://github.com/goss-org/goss/actions/workflows/docker-integration-tests.yaml job not working is there anything remaining here?

Also, any thoughts on how to fix that job?

@dklimpel
Copy link
Contributor Author

Unfortunately i don't have that much time at the moment

Short view:

  • CentOS 7 is EoL. The repositories no longer exist. Therefore this runs on a error. It will no longer be possible to build the image locally. It would have to be replaced. The idea was to replace this with rockylinux and remove centos.
    • The solution for here: Remove CentOS from repo or add an ignore to this pipeline for this image.
  • For Alpine3.19, I think the problem was that the version of apache2 2.4.59 was no longer available. It is pinned in the image, because goss checks the version in the test. I don't think any regex can do the version check. So the apache version must be available at the time of the first creation of the image for the repo. Currently 2.4.62-r0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants