Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add filesystem scrubbing controller #9848

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

dsseng
Copy link
Member

@dsseng dsseng commented Nov 29, 2024

TODO

  • tests for scrub controller (needs ideas how do we avoid waiting a lot)
  • report status for monitoring of scheduled scrubs, most recent scrubs date, duration and result.

Fixes #9545

DefaultScrubPeriod = 24 * 7 * time.Hour
)

// FilesystemScrubV1Alpha1 is a filesystem scrubbing config document.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs must mention this feature is experimental and perhaps link to kernel docs about this

ctrl.schedule[mountpoint].timer.Reset(ctrl.schedule[mountpoint].period)
}

ctrl.schedule[mountpoint] = scrubSchedule{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we need to somehow protect from user creating multiple documents referencing the same mountpoint.

runner.WithSchedulingPolicy(runner.SchedulingPolicyIdle),
)

return r.Run(func(s events.ServiceState, msg string, args ...any) {})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have a way to cancel running scrub process?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a channel to signal this once document is deleted. However I haven't studied whether or not it's okay to abort the process and does it terminate safely

Copy link
Member

@smira smira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as xfs_scrub is experimental, I'd say let's do it the first thing we merge in 1.10

@dsseng
Copy link
Member Author

dsseng commented Dec 5, 2024

as xfs_scrub is experimental, I'd say let's do it the first thing we merge in 1.10

Didn't we consider it in 1.9 planning? Or it wasn't expected to use experimental kernel feature?

dsseng added 11 commits December 5, 2024 18:24
Signed-off-by: Dmitry Sharshakov <[email protected]>
Signed-off-by: Dmitry Sharshakov <[email protected]>
Signed-off-by: Dmitry Sharshakov <[email protected]>
Signed-off-by: Dmitry Sharshakov <[email protected]>
Signed-off-by: Dmitry Sharshakov <[email protected]>
work around `"FilesystemScrubConfig" "v1alpha1": not registered`
Signed-off-by: Dmitry Sharshakov <[email protected]>
case <-ctx.Done():
return nil
case mountpoint := <-ctrl.c:
if err := ctrl.runScrub(mountpoint, []string{}); err != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea: run scrub in a goroutine (still single-threaded to not run two scrub tasks in parallel) and report when it's started so we can see it's running right now from the status. Current status example (and there's no way to tell whether one for /var is running or not yet, as status is updated on completion only):
image

@dsseng
Copy link
Member Author

dsseng commented Dec 16, 2024

Other things to consider:

  • add an option to scrub on boot
  • check whether or not scrub may be aborted. If not ensure it's not and we must delay reboot/poweroff to ensure this

@@ -1277,6 +1277,38 @@ var DefaultDroppedCapabilities = map[string]struct{}{
"cap_sys_module": {},
}

// XFSScrubDroppedCapabilities is the set of capabilities to drop for xfs_scrub.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could refactor this to first list all capabilities (via libcap), and remove those we want to drop, so that if we new capabilities are introduced, they are dropped as well. (probably taking it out of machinery back to the controller)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also want to do such a thing on the level of our interface to libcap. Actually systemd does manage such a thing, since in xfs scrubbing service there is a list of capabilities to give, not take

)

// FSScrubConfigType is type of FSScrubConfig resource.
const FSScrubConfigType = resource.Type("FSScrubConfigs.runtime.talos.dev")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should move this under block. (both resources and config)

ctrl.status = make(map[string]scrubStatus)
ctrl.c = make(chan string, 5)

for {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost wonder if we should split up this controller into two controllers:

  • one which reads FSScrubConfig and outputs FSScrubSchedule resources (builds a schedule, handles jitter, updates schedule, etc.) - this controller can be fully unit-tested
  • another which reads FSScrubSchedule, and based on the schedule runs the actual xfs_scrub tasks (or cancels them)

We can test time-based stuff using fake clocks, example

func TestCRIImageGC(t *testing.T) {
mockImageService := &mockImageService{}
fakeClock := clock.NewMock()
suite.Run(t, &CRIImageGCSuite{
mockImageService: mockImageService,
fakeClock: fakeClock,
DefaultSuite: ctest.DefaultSuite{
AfterSetup: func(suite *ctest.DefaultSuite) {
suite.Require().NoError(suite.Runtime().RegisterController(&runtimectrl.CRIImageGCController{
ImageServiceProvider: func() (runtimectrl.ImageServiceProvider, error) {
return mockImageService, nil
},
Clock: fakeClock,
}))
},
},
})
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sane. We already have the config controller which populates the structures looking the same as the yaml source, why not handle scheduling there, yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, no... config controller is good, don't touch it ;)

Schedule controller will be fully testable without mocks.

Scrub controller will need a clock mock and an actual scrub mock, and it can be fully tested once again by manipulating schedules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

xfs_scrub controller
3 participants