Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Secure the public Jenkins server #2108

Closed
smlambert opened this issue Apr 6, 2021 · 56 comments
Closed

Secure the public Jenkins server #2108

smlambert opened this issue Apr 6, 2021 · 56 comments
Assignees
Labels
Machine Request secure-dev Issues specific to SSDF/SLSA compliance work security

Comments

@smlambert
Copy link
Contributor

smlambert commented Apr 6, 2021

This issue is to encapsulate the requirements and work needed to replace our existing Jenkins server with a new one:

Existing Jenkins server is suffering from some instability. Recent difficult updates have shown that we need a better disaster recovery plan (this is also related to: #1295) It is also a ubuntu-16.04 machine. For all of these reasons, and likely reasons not yet listed, we should begin the work to replace the existing Jenkins server with a new one.

Requirements:

  • Similar requirements to existing server
  • Ansible playbook for creation/spinup
  • Staging server to try out upgrades and major features or changes
  • Clearly documented process for deploying updates
  • Back up / disaster recovery process (in place and tested) / training for multiple people in multiple timezones to be able to assist
  • Policy & documentation for how many jobs to keep in history, how many days to keep jobs
  • How many of them are required: 1

Please explain what this machine is needed for:

  • nightly/weekly/release builds and testing
  • Grinders for debugging and triage
  • building tools and dependencies
  • building Docker images
  • building installers

Considerations:

  • Given that this server is for our production builds, should we limit freeform builds, personal duplicates of existing pipelines, etc.?
  • Should we have a separate sandbox for 'experimental' jobs and work?
  • We should have a script that checks for jobs that have not been run/used in X months to remove old/stale jobs (may not need if all jobs have policy for how many days to keep jobs)
  • Consider pushing artifacts to artifactory or other storage so that they are available for longer than
@karianna
Copy link
Contributor

karianna commented Apr 6, 2021

I'd personally like to see this bootstrapped via ansible and that we also build a staging server to test out any major pipeline changes, jenkins upgrades, and so forth.

@smlambert
Copy link
Contributor Author

re: #2108 (comment) - added to the requirements section

@sxa
Copy link
Member

sxa commented Apr 7, 2021

@karianna Since you've been pushing for this and suggesting requirements, are you able to take ownership of this item and progress it?

@karianna karianna self-assigned this Apr 7, 2021
@karianna karianna added this to the April 2021 milestone Apr 7, 2021
@karianna
Copy link
Contributor

karianna commented Apr 7, 2021

@karianna Since you've been pushing for this and suggesting requirements, are you able to take ownership of this item and progress it?

I can own it, but I'll be pulling in folks who actually know how to ansible :-). Typical Engineering manager eh ;-)

@karianna
Copy link
Contributor

karianna commented Apr 19, 2021

https://github.com/jenkins-x/jx is something to explore

@sxa
Copy link
Member

sxa commented Apr 23, 2021

I also think that for security we should disallow jobs from running on the master node unless there's a clear and explicit need for it.

@karianna
Copy link
Contributor

In addition to Shelley's requirements, I'm also going to add VPN security - this Jenkins should no longer be accessible on the public internet.

@sxa
Copy link
Member

sxa commented Apr 23, 2021

In addition to Shelley's requirements, I'm also going to add VPN security - this Jenkins should no longer be accessible on the public internet.

To be absoltely clear here - that statement is about non-HTTPS ports right?
(I agree and was thinking the same, however we also need incoming ports enabled for several of our machines to connect)

@karianna
Copy link
Contributor

After further investigation jenkins-x is not a great option as its pipeline implementation and syntax is not compatible with regular Jenkins, forcing us to have a massive rewrite (and forcing jenkins x on other vendors using our scripts).

@karianna
Copy link
Contributor

By using Ansible we can be fairly hosting provider agnostic. But we do need a provider that hosts a VPN easily and has sufficient disk storage (3-4TB) at a decent price point.

@karianna
Copy link
Contributor

karianna commented Apr 28, 2021

We'll prototype:

  • Ansible Playbook for Jenkins with a Jenkins LTS Docker Image as a base starting point.
  • Deploying to Hetzner Cloud in a Docker Container
  • Apply Networking, Firewalling and OpenVPN for security

@sxa
Copy link
Member

sxa commented May 14, 2021

As discussed elsewhere while the full set of items descried in the prototype comment above has now been put on pause we should as a priority look at upgrading the OS underneath the jenkins server (ideally to 20.04) after taking a Hetzner snapshot to avoid any problems with the OS upgrade.

@Haroon-Khel Haroon-Khel modified the milestones: April 2021, May 2021 May 18, 2021
@sxa
Copy link
Member

sxa commented May 20, 2021

Eclipse WorkGroup finance approval acquired for the snapshot costs

@sxa
Copy link
Member

sxa commented Mar 30, 2023

Jenkins now upgraded to the latest LTS version (2.387.1) and all plugins updated to latest other than SAML and SSH Build Agents which have warning messages associated with them so may require remedial action.

@sxa
Copy link
Member

sxa commented Apr 3, 2023

Will investigate https://plugins.jenkins.io/job-restrictions/ as a way of locking down where build jobs can be executed.

This has now been installed alongside last week's jenkins upgrade.

@sxa
Copy link
Member

sxa commented Apr 3, 2023

Now all removed other than bethgriggs on the release notes job, and an explicit entry for anonymous on build-pipeline-generator.

@sxa sxa pinned this issue Apr 4, 2023
@sxa
Copy link
Member

sxa commented Apr 5, 2023

@karianna Are you still planning to own / interesting in persuing looking at moving the jenkins server setup/config to configuration-as-code with ansible as per the earlier comment?

@sxa
Copy link
Member

sxa commented Apr 5, 2023

@karianna I'm also going to suggest we schedule a maintenance window on the first Tuesday of each month to perform jenkins plugin updates and Jenkins LTS updates that are required (and also use that day for patching other infrastructure servers, although jenkins should always be the priority). This should generally us two clear weeks after the upstream openjdk releases. If more time is required due to release issues we can move out to the following Tuesday. Jenkins patch levels for LTS releases will come out every four weeks so we may have a delay in picking up new ones. We should monitor the types of fixes going in and see if delays caused by doing it on the first Tuesday are likely to be problematic from a security perspective.

We currently have quarterly listed in the infra readme (although we haven't been adhering to that) but I think that doc should be revised to be monthly. How does that sound?

Security updates to the OS are performed via a cron job on Sunday at 5am and recorded in /var/log/apt-security-updates. We should look at storing the apt-security.sh script that is run from cron in github and deploy it onto other critical infrastructure servers. I will also note that at the moment we do not have a regular policy for restarting machines to pick up kernel updates.

@sxa sxa modified the milestones: 2023-03 (March), 2023-04 (April) Apr 5, 2023
@karianna
Copy link
Contributor

karianna commented Apr 6, 2023

@karianna Are you still planning to own / interesting in persuing looking at moving the jenkins server setup/config to configuration-as-code with ansible as per the earlier comment?

I ungracefully remove myself from owning that - I'll never get to it I'm afraid.

@sxa
Copy link
Member

sxa commented Apr 13, 2023

Thinbackup plugin has been adjusted to do a full backup on the first of the month and incrementals for the rest of the days as per #1295 (comment)

@sxa
Copy link
Member

sxa commented Apr 13, 2023

On the basis that the original concerns about stability appear to no longer be valid, and this issue is titled as a security one I'm going to revisit the criteria for closing this issue. A backup server would be desirable (or even a way to spin up a docker container with a comparable configuration for local testing) although we would generally not want it be identical as we wouldn't want it to interfere with all of the production machines. I will create a separate issue from this one to determine feasibility and implementation decisions relating to that for the future.

@sxa
Copy link
Member

sxa commented Apr 14, 2023

Backup verification is going to take a little longer. There are some issues with the restore not quite recovering everything - possibly because the thinbackups are aborting part way through. Will continue to progress it, but the thinbackup does contain enough to recover (re-download) the plugins, although when recovering to another server the GitHub authentication plugin won't work (and actively prevents the server from restarting so you have to remove it from the config.xml)

@sxa
Copy link
Member

sxa commented Apr 17, 2023

With a successful backup taken (/home/jenkins/.jenkins/backups/FULL-2023-04-14_19-17) I was able to restore to a new machine and start up the server successfully (nodes directory removed before restarting to avoid it conneting to the same slaves as the real server, and also removing the <securityRealm class="org.jenkinsci.plugins.GithubSecurityRealm"> section from the config.xml (which prevents github logins) - Without that it'll go through the authentication in a way that logs you into the production jenkins server at ci.adoptium.net)

Alternatively:

  1. Start up jenkins. Don't install any plugins or add any extra users
  2. Once it's up, install the thinbackup plugin (jenkins won't give you the option for thinbackup when it initially prompts for a plugin list)
  3. Restore master.key and hudson.util.Secret under ~/.jenkins/secrets after first startup before you restore anything. (Note: secret.key is almost certainly not required, but could be restored too)
  4. Restart jenkins (or click the restart checkbox on the plugin upgrade screen) If you've forgotten the password for the admin user, it's in ~jenkins/secrets/initialAdminPassword
  5. Go to Manage Jenkins -> ThinBackup -> Settings to set the backup dir, then restore the backup you want, including the options to restore plugins and last build numbers [*]
  6. (If restoring to alternate server) Get your own application token from https://github.com/settings/applications/new
  7. Put the application ID and secret into the <clientId> and <clientSecret> seconds of the <securityRealm> section (Secret can be put in unencrypted - jenkins will encrypt it and rewrite config.xml on next startup - or shortly after? - I didn't check...) If you dond't do this jenkins won't start up properly with com.thoughtworks.xstream.mapper.CannotResolveClassException: hudson.security.ProjectMatrixAuthorizationStrategy
  8. (If restoring to alternate server) REMOVE THE ~/.jenkins/nodes DIRECTORY to stop the
  9. Restart the jenkins server (Reload configuration from disk probably won't be adequate) and you should be ready to go! Don't worry about any "Failed to load Owner" messages - they come from referenced jobs runs that aren't in the backup.
  10. If you want to verify that the secrets are working correctly:
  • Go to /computer and add the worker label to the build-in node, set it to have at least one executor, and set it to Use this node as much as possible under the Usage drop down (See screenshot below)
  • Try running one of the nightlyBuildAndTestStats jobs pointing to a different slack channel such as #infrastructure-bot is a good choice as that will validate that the encrypted slack token from the backup works.

image

[*] - For step 5, nNote: while you can click the "restore plugins" that will download them all again from the internet, which isn't needed if they are included in your backup. If you do make it download and you have them in your backup, then the log will get NoSuchFileException messages from hpi plugin backups, although it will likely still work ok.

Note that during the final startup I got a lot of WARNING o.j.p.w.flow.FlowExecutionList$1#computeNext: Failed to load Owner messages in the console log but the jobs still seem present and correct.

@sxa
Copy link
Member

sxa commented Apr 17, 2023

The creation of the backups appears to be failing when using the (NFS mounted) backup file system on the server. After about 8 minutes it gives this message: SEVERE h.i.i.InstallUncaughtExceptionHandler$DefaultUncaughtExceptionHandler#uncaughtException: A thread (ThinBackup Worker Thread thread/1018519) died unexpectedly due to an uncaught exception, this may leave your Jenkins in a bad way and is usually indicative of a bug in the code whereas if I set thinbackup to use a local directory it seems to complete in around seven minutes (Backups are about 17Gb, compressing to just under 2Gb). It only gives the first four found x jobs in /home/... to back up messages, of about 60 or so.

So in order to fix this and produce useful backups we'll need to resolve the performance issue on the backup drive, or backup locally and have something else move them onto the backup drive separately.

FYI @karianna

@karianna
Copy link
Contributor

Hmm,, annoying that we don't see the true underlying cause. I wonder if we can test writing a large file from the O/S

@sxa
Copy link
Member

sxa commented Apr 25, 2023

Hmm,, annoying that we don't see the true underlying cause. I wonder if we can test writing a large file from the O/S

The (NFS IIRC) filesystem is noticeably slow so that can definitely be reproduced outside Jenkins. It's writing the individual files directly to the filesystem as opposed to a zip file so a lot of individual random wires to open lots of files which won't be helping

@sxa
Copy link
Member

sxa commented Apr 27, 2023

I'm going to close this now. Remaining action items that are explicitly planned can be done in the separate items mentioned elsewhere in this issue but we now have policies in place for keeping the server up to date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Machine Request secure-dev Issues specific to SSDF/SLSA compliance work security
Projects
Development

No branches or pull requests

7 participants