-
Notifications
You must be signed in to change notification settings - Fork 591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add filesystem scrubbing controller #9848
base: main
Are you sure you want to change the base?
Conversation
DefaultScrubPeriod = 24 * 7 * time.Hour | ||
) | ||
|
||
// FilesystemScrubV1Alpha1 is a filesystem scrubbing config document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs must mention this feature is experimental and perhaps link to kernel docs about this
ctrl.schedule[mountpoint].timer.Reset(ctrl.schedule[mountpoint].period) | ||
} | ||
|
||
ctrl.schedule[mountpoint] = scrubSchedule{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we need to somehow protect from user creating multiple documents referencing the same mountpoint.
runner.WithSchedulingPolicy(runner.SchedulingPolicyIdle), | ||
) | ||
|
||
return r.Run(func(s events.ServiceState, msg string, args ...any) {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should have a way to cancel running scrub process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add a channel to signal this once document is deleted. However I haven't studied whether or not it's okay to abort the process and does it terminate safely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as xfs_scrub
is experimental, I'd say let's do it the first thing we merge in 1.10
Didn't we consider it in 1.9 planning? Or it wasn't expected to use experimental kernel feature? |
Signed-off-by: Dmitry Sharshakov <[email protected]>
Signed-off-by: Dmitry Sharshakov <[email protected]>
Signed-off-by: Dmitry Sharshakov <[email protected]>
Signed-off-by: Dmitry Sharshakov <[email protected]>
Signed-off-by: Dmitry Sharshakov <[email protected]>
work around `"FilesystemScrubConfig" "v1alpha1": not registered`
Signed-off-by: Dmitry Sharshakov <[email protected]>
case <-ctx.Done(): | ||
return nil | ||
case mountpoint := <-ctrl.c: | ||
if err := ctrl.runScrub(mountpoint, []string{}); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idea: run scrub in a goroutine (still single-threaded to not run two scrub tasks in parallel) and report when it's started so we can see it's running right now from the status. Current status example (and there's no way to tell whether one for /var is running or not yet, as status is updated on completion only):
Other things to consider:
|
@@ -1277,6 +1277,38 @@ var DefaultDroppedCapabilities = map[string]struct{}{ | |||
"cap_sys_module": {}, | |||
} | |||
|
|||
// XFSScrubDroppedCapabilities is the set of capabilities to drop for xfs_scrub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could refactor this to first list all capabilities (via libcap), and remove those we want to drop, so that if we new capabilities are introduced, they are dropped as well. (probably taking it out of machinery back to the controller)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I also want to do such a thing on the level of our interface to libcap. Actually systemd does manage such a thing, since in xfs scrubbing service there is a list of capabilities to give, not take
) | ||
|
||
// FSScrubConfigType is type of FSScrubConfig resource. | ||
const FSScrubConfigType = resource.Type("FSScrubConfigs.runtime.talos.dev") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should move this under block.
(both resources and config)
ctrl.status = make(map[string]scrubStatus) | ||
ctrl.c = make(chan string, 5) | ||
|
||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I almost wonder if we should split up this controller into two controllers:
- one which reads FSScrubConfig and outputs FSScrubSchedule resources (builds a schedule, handles jitter, updates schedule, etc.) - this controller can be fully unit-tested
- another which reads FSScrubSchedule, and based on the schedule runs the actual xfs_scrub tasks (or cancels them)
We can test time-based stuff using fake clocks, example
func TestCRIImageGC(t *testing.T) { | |
mockImageService := &mockImageService{} | |
fakeClock := clock.NewMock() | |
suite.Run(t, &CRIImageGCSuite{ | |
mockImageService: mockImageService, | |
fakeClock: fakeClock, | |
DefaultSuite: ctest.DefaultSuite{ | |
AfterSetup: func(suite *ctest.DefaultSuite) { | |
suite.Require().NoError(suite.Runtime().RegisterController(&runtimectrl.CRIImageGCController{ | |
ImageServiceProvider: func() (runtimectrl.ImageServiceProvider, error) { | |
return mockImageService, nil | |
}, | |
Clock: fakeClock, | |
})) | |
}, | |
}, | |
}) | |
} | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sane. We already have the config controller which populates the structures looking the same as the yaml source, why not handle scheduling there, yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, no... config controller is good, don't touch it ;)
Schedule controller will be fully testable without mocks.
Scrub controller will need a clock mock and an actual scrub mock, and it can be fully tested once again by manipulating schedules.
TODO
Fixes #9545