feat: add filesystem scrubbing controller #9848

dsseng · 2024-11-29T20:05:19Z

TODO

tests for scrub controller (needs ideas how do we avoid waiting a lot)
report status for monitoring of scheduled scrubs, most recent scrubs date, duration and result.

dsseng · 2024-12-05T09:04:15Z

pkg/machinery/config/types/runtime/fs_scrub.go

+	DefaultScrubPeriod = 24 * 7 * time.Hour
+)
+
+// FilesystemScrubV1Alpha1 is a filesystem scrubbing config document.


Docs must mention this feature is experimental and perhaps link to kernel docs about this

dsseng · 2024-12-05T16:08:41Z

internal/app/machined/pkg/controllers/runtime/fs_scrub.go

+				ctrl.schedule[mountpoint].timer.Reset(ctrl.schedule[mountpoint].period)
+			}
+
+			ctrl.schedule[mountpoint] = scrubSchedule{


I guess we need to somehow protect from user creating multiple documents referencing the same mountpoint.

smira · 2024-12-05T16:12:11Z

internal/app/machined/pkg/controllers/runtime/fs_scrub.go

+		runner.WithSchedulingPolicy(runner.SchedulingPolicyIdle),
+	)
+
+	return r.Run(func(s events.ServiceState, msg string, args ...any) {})


I wonder if we should have a way to cancel running scrub process?

We can add a channel to signal this once document is deleted. However I haven't studied whether or not it's okay to abort the process and does it terminate safely

smira

as xfs_scrub is experimental, I'd say let's do it the first thing we merge in 1.10

dsseng · 2024-12-05T16:20:43Z

as xfs_scrub is experimental, I'd say let's do it the first thing we merge in 1.10

Didn't we consider it in 1.9 planning? Or it wasn't expected to use experimental kernel feature?

Signed-off-by: Dmitry Sharshakov <[email protected]>

work around `"FilesystemScrubConfig" "v1alpha1": not registered`

Signed-off-by: Dmitry Sharshakov <[email protected]>

dsseng · 2024-12-05T17:27:27Z

internal/app/machined/pkg/controllers/runtime/fs_scrub.go

+		case <-ctx.Done():
+			return nil
+		case mountpoint := <-ctrl.c:
+			if err := ctrl.runScrub(mountpoint, []string{}); err != nil {


Idea: run scrub in a goroutine (still single-threaded to not run two scrub tasks in parallel) and report when it's started so we can see it's running right now from the status. Current status example (and there's no way to tell whether one for /var is running or not yet, as status is updated on completion only):

dsseng · 2024-12-16T20:52:04Z

Other things to consider:

add an option to scrub on boot
check whether or not scrub may be aborted. If not ensure it's not and we must delay reboot/poweroff to ensure this

smira · 2024-12-19T11:32:20Z

pkg/machinery/constants/constants.go

@@ -1277,6 +1277,38 @@ var DefaultDroppedCapabilities = map[string]struct{}{
 	"cap_sys_module": {},
 }

+// XFSScrubDroppedCapabilities is the set of capabilities to drop for xfs_scrub.


I wonder if we could refactor this to first list all capabilities (via libcap), and remove those we want to drop, so that if we new capabilities are introduced, they are dropped as well. (probably taking it out of machinery back to the controller)

Yes, I also want to do such a thing on the level of our interface to libcap. Actually systemd does manage such a thing, since in xfs scrubbing service there is a list of capabilities to give, not take

smira · 2024-12-19T11:34:27Z

pkg/machinery/resources/runtime/fs_scrub_config.go

+)
+
+// FSScrubConfigType is type of FSScrubConfig resource.
+const FSScrubConfigType = resource.Type("FSScrubConfigs.runtime.talos.dev")


I wonder if we should move this under block. (both resources and config)

smira · 2024-12-19T11:37:03Z

internal/app/machined/pkg/controllers/runtime/fs_scrub.go

+	ctrl.status = make(map[string]scrubStatus)
+	ctrl.c = make(chan string, 5)
+
+	for {


I almost wonder if we should split up this controller into two controllers:

one which reads FSScrubConfig and outputs FSScrubSchedule resources (builds a schedule, handles jitter, updates schedule, etc.) - this controller can be fully unit-tested

another which reads FSScrubSchedule, and based on the schedule runs the actual xfs_scrub tasks (or cancels them)

We can test time-based stuff using fake clocks, example

talos/internal/app/machined/pkg/controllers/runtime/cri_image_gc_test.go

Lines 33 to 52 in f756043

func TestCRIImageGC(t *testing.T) {

mockImageService := &mockImageService{}

fakeClock := clock.NewMock()

suite.Run(t, &CRIImageGCSuite{

mockImageService: mockImageService,

fakeClock: fakeClock,

DefaultSuite: ctest.DefaultSuite{

AfterSetup: func(suite *ctest.DefaultSuite) {

suite.Require().NoError(suite.Runtime().RegisterController(&runtimectrl.CRIImageGCController{

ImageServiceProvider: func() (runtimectrl.ImageServiceProvider, error) {

return mockImageService, nil

},

Clock: fakeClock,

}))

},

},

})

}

Yes, sane. We already have the config controller which populates the structures looking the same as the yaml source, why not handle scheduling there, yes

No, no... config controller is good, don't touch it ;)

Schedule controller will be fully testable without mocks.

Scrub controller will need a clock mock and an actual scrub mock, and it can be fully tested once again by manipulating schedules.

dsseng self-assigned this Nov 29, 2024

dsseng force-pushed the scrub branch from 22f95d1 to e178d4d Compare December 1, 2024 09:59

dsseng mentioned this pull request Dec 2, 2024

feat: add more scheduling options for process runner #9862

Merged

dsseng force-pushed the scrub branch from e2a97f6 to 2fc88fc Compare December 4, 2024 19:48

dsseng marked this pull request as ready for review December 5, 2024 08:59

talos-bot added the status/ok-to-test label Dec 5, 2024

dsseng commented Dec 5, 2024

View reviewed changes

smira reviewed Dec 5, 2024

View reviewed changes

dsseng added 11 commits December 5, 2024 18:24

WIP: fs_scrub controller

a095a5b

Signed-off-by: Dmitry Sharshakov <[email protected]>

schedule scrub

95069dd

Signed-off-by: Dmitry Sharshakov <[email protected]>

WIP: config

7657625

Signed-off-by: Dmitry Sharshakov <[email protected]>

generate

b5fa387

Signed-off-by: Dmitry Sharshakov <[email protected]>

wip

4f67d52

Signed-off-by: Dmitry Sharshakov <[email protected]>

HACK set config without machine config for testing

98e005b

work around `"FilesystemScrubConfig" "v1alpha1": not registered`

fix: use buffered channel

b28b41c

test: add test for fs_scrub_config controller

84a5800

.

d88e782

report status

3e7becd

Signed-off-by: Dmitry Sharshakov <[email protected]>

fixes

1df8e37

dsseng force-pushed the scrub branch from 6aed532 to 1df8e37 Compare December 5, 2024 17:24

dsseng commented Dec 5, 2024

View reviewed changes

onedr0p mentioned this pull request Dec 7, 2024

fstrim and monitoring for disks #8314

Open

smira reviewed Dec 19, 2024

View reviewed changes

smira mentioned this pull request Dec 26, 2024

xfs corruption, but no xfs_repair #8292

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add filesystem scrubbing controller #9848

feat: add filesystem scrubbing controller #9848

dsseng commented Nov 29, 2024 •

edited

Loading

dsseng Dec 5, 2024

dsseng Dec 5, 2024

smira Dec 5, 2024

dsseng Dec 5, 2024

smira left a comment

dsseng commented Dec 5, 2024

dsseng Dec 5, 2024

dsseng commented Dec 16, 2024

smira Dec 19, 2024

dsseng Dec 19, 2024

smira Dec 19, 2024

smira Dec 19, 2024

dsseng Dec 19, 2024

smira Dec 19, 2024

	func TestCRIImageGC(t *testing.T) {
	mockImageService := &mockImageService{}
	fakeClock := clock.NewMock()

	suite.Run(t, &CRIImageGCSuite{
	mockImageService: mockImageService,
	fakeClock: fakeClock,
	DefaultSuite: ctest.DefaultSuite{
	AfterSetup: func(suite *ctest.DefaultSuite) {
	suite.Require().NoError(suite.Runtime().RegisterController(&runtimectrl.CRIImageGCController{
	ImageServiceProvider: func() (runtimectrl.ImageServiceProvider, error) {
	return mockImageService, nil
	},
	Clock: fakeClock,
	}))
	},
	},
	})
	}

feat: add filesystem scrubbing controller #9848

Are you sure you want to change the base?

feat: add filesystem scrubbing controller #9848

Conversation

dsseng commented Nov 29, 2024 • edited Loading

TODO

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smira left a comment

Choose a reason for hiding this comment

dsseng commented Dec 5, 2024

Choose a reason for hiding this comment

dsseng commented Dec 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsseng commented Nov 29, 2024 •

edited

Loading