Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sled-agent performs archival of rotated logs for all zones onto U.2 debug dataset #3713

Merged
merged 4 commits into from
Jul 24, 2023

Conversation

lifning
Copy link
Contributor

@lifning lifning commented Jul 19, 2023

(for #2478, depends on #3677 (lifning/omicron@coreadm...log-rotate))

This periodically moves logs rotated by logadm in cron (oxidecomputer/helios#107) into the crypt/debug zfs dataset on the U.2 chosen by the logic in #3677. It replaces the rotated number (*.log.0, *.log.1) with the unix epoch timestamp of the rotated log's modification time such that they don't collide when collected repeatedly (logadm will reset numbering when the previous ones are moved away).


After putting kernel dumps on both M.2 dump slices, starting sled-agent, forcing logadm -p now smf_logs_daily in every zone, then running int main() { return *(int*)0; } to generate a core in an oxz_nexus_ zone and the global zone:

EVT22200004 # ls /pool/ext/*/crypt/debug/*
/pool/ext/5439cec8-dd17-46e8-ae32-8b694639d3b5/crypt/debug/bounds
/pool/ext/5439cec8-dd17-46e8-ae32-8b694639d3b5/crypt/debug/core.global.a.out.9465.1689907180
/pool/ext/5439cec8-dd17-46e8-ae32-8b694639d3b5/crypt/debug/core.oxz_nexus_57947b25-e3c3-4109-8676-2abae5096d1c.a.out.29236.1689908216
/pool/ext/5439cec8-dd17-46e8-ae32-8b694639d3b5/crypt/debug/vmdump.0
/pool/ext/5439cec8-dd17-46e8-ae32-8b694639d3b5/crypt/debug/vmdump.1

/pool/ext/5439cec8-dd17-46e8-ae32-8b694639d3b5/crypt/debug/global:
milestone-devices:default.log.1689900201
milestone-devices:default.log.1689907251
milestone-multi-user-server:default.log.1689900201
milestone-multi-user-server:default.log.1689907250
[...]
system-zones-monitoring:default.log.1689900201
system-zones-monitoring:default.log.1689907250

/pool/ext/5439cec8-dd17-46e8-ae32-8b694639d3b5/crypt/debug/oxz_nexus_57947b25-e3c3-4109-8676-2abae5096d1c:
oxide-nexus:default.log.1689907587

/pool/ext/5439cec8-dd17-46e8-ae32-8b694639d3b5/crypt/debug/oxz_switch:
system-illumos-mg-ddm:default.log.1689907595

@lifning lifning changed the title (WIP) Log rotate (WIP) Log rotation onto U.2 debug dataset Jul 19, 2023
@lifning lifning changed the title (WIP) Log rotation onto U.2 debug dataset (WIP) sled-agent performs log rotation for all zones onto U.2 debug dataset Jul 19, 2023
Comment on lines 512 to 513
// as we rotate them out, logadm will keep resetting to .log.0,
// so we need to maintain our own numbering in the dest dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest using .<epoch seconds> instead of .N here (and adding .N in the unlikely event there is a conflict).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should reconfigure logadm in the zone to rotate the files using a date stamp pattern, like 2023-07-19T10:00:00Z (%FT%TZ maybe?) rather than try to work around the integer suffixes.

) -> Result<(), RotateLogsError> {
// pattern matching rotated logs, e.g. foo.log.3
let pattern = logdir
.join("*.log.*")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably safer to use *.log.? and *.log.?? as patterns and combine the glob matches, just in case there is a file that has .log. in the middle of its base name (shouldn't happen, but...)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using a glob or a regex, perhaps we should look at creating a separate directory in the zone to hold the rotated files and have logadm move them there as part of rotation?

@jclulow
Copy link
Collaborator

jclulow commented Jul 19, 2023

Just as a small terminological suggestion: I would draw a distinction between the act of rotation, which I expect we're still having logadm perform inside the zone, and the act of archival, which is lifting them out once they've been rotated and which it seems we're looking to do from outside the zone here.

@lifning lifning changed the title (WIP) sled-agent performs log rotation for all zones onto U.2 debug dataset (WIP) sled-agent performs log archival for all zones onto U.2 debug dataset Jul 20, 2023
@lifning lifning changed the title (WIP) sled-agent performs log archival for all zones onto U.2 debug dataset (WIP) sled-agent performs archival of rotated logs for all zones onto U.2 debug dataset Jul 20, 2023
@lifning lifning changed the title (WIP) sled-agent performs archival of rotated logs for all zones onto U.2 debug dataset sled-agent performs archival of rotated logs for all zones onto U.2 debug dataset Jul 21, 2023
@lifning lifning marked this pull request as ready for review July 21, 2023 03:23
@citrus-it citrus-it self-requested a review July 24, 2023 18:00
lifning added a commit that referenced this pull request Jul 24, 2023
(Part of #2478, continued in #3713)

This configures coreadm to put all core dumps onto the M.2 'crash'
dataset, and creates a thread that moves them all onto a U.2 'debug'
dataset every 5 minutes.

This also refactors the dumpadm/savecore code to be less redundant and
more flexible, and adds an amount of arbitrary logic for e.g. picking
the U.2 onto which to save cores.
@lifning lifning enabled auto-merge (squash) July 24, 2023 18:04
@lifning lifning merged commit 9fd3f58 into oxidecomputer:main Jul 24, 2023
@lifning lifning deleted the log-rotate branch July 24, 2023 19:40
leftwo pushed a commit that referenced this pull request Jul 24, 2023
(Part of #2478, continued in #3713)

This configures coreadm to put all core dumps onto the M.2 'crash'
dataset, and creates a thread that moves them all onto a U.2 'debug'
dataset every 5 minutes.

This also refactors the dumpadm/savecore code to be less redundant and
more flexible, and adds an amount of arbitrary logic for e.g. picking
the U.2 onto which to save cores.
leftwo pushed a commit that referenced this pull request Jul 24, 2023
…ebug dataset (#3713)

This periodically moves logs rotated by logadm in cron
(oxidecomputer/helios#107) into the crypt/debug
zfs dataset on the U.2 chosen by the logic in #3677. It replaces the
rotated number (*.log.0, *.log.1) with the unix epoch timestamp of the
rotated log's modification time such that they don't collide when
collected repeatedly (logadm will reset numbering when the previous ones
are moved away).

(for #2478)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants