Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SUPPRESS_LOG env to control the log outputs #332

Closed
wants to merge 1 commit into from

Conversation

liutgnu
Copy link
Contributor

@liutgnu liutgnu commented Nov 5, 2024

When an IRQ fail to set affinity, irqbalance will output logs as: "Cannot change IRQ x affinity: ..." and "IRQ x affinity is now unmanaged". This usually happens at the start of irqbalance service, when irqbalance trying to detect if a specific IRQ is migratable. And in most cases, hardware/IRQs stay unchanged thus the logs only output once and acceptable.

However there are cases where hardware constantly goes online/offline for maintenance or other user requested actions, expecially the hardware holds lots of IRQs, irqbalance will output these logs lots of time thus over flood the journals and annoying.

This patch will introduce a SUPPRESS_LOG env variable, to help users suppress these logs when don't need.

When an IRQ fail to set affinity, irqbalance will output logs as:
"Cannot change IRQ x affinity: ..." and "IRQ x affinity is now unmanaged".
This usually happens at the start of irqbalance service, when irqbalance
trying to detect if a specific IRQ is migratable. And in most cases,
hardware/IRQs stay unchanged thus the logs only output once and
acceptable.

However there are cases where hardware constantly goes online/offline for
maintenance or other user requested actions, expecially the hardware
holds lots of IRQs, irqbalance will output these logs lots of time thus
over flood the journals and annoying.

This patch will introduce a SUPPRESS_LOG env variable, to help users
suppress these logs when don't need.

Signed-off-by: Tao Liu <[email protected]>
@nhorman
Copy link
Member

nhorman commented Nov 5, 2024

That isn't going to do anything for you. If you set an environment variable in a shell, it won't be picked up by irqbalance (or any process) unless you export it and restart the affected process. If you're going to restart it, you may as well just stop the process entirely until your maintenance is done, and then restart it.

@liutgnu
Copy link
Contributor Author

liutgnu commented Nov 5, 2024

Hi @nhorman ,

Thanks for your comments!

With the patch, the logs can be suppressed by editing the misc/irqbalance.env file, setting the SUPPRESS_LOG env variable and restarting the irqbalance service, then irqbalance can pick up the SUPPRESS_LOG env variable successfully.

I also tried to add a new parameter for irqbalance, e.g. "-u", so users can enable it by setting misc/irqbalance.env file, setting IRQBALANCE_ARGS="-u", and restart the irqbalance service. But I think the former approach is more simpler.

Anyway, both approaches require a restart of irqbalance service, in order to enable the log suppression. I cannot think of an approach without restart irqbalance service. Could you please tell me your preferred approach for this?

Thanks,
Tao Liu

@nhorman
Copy link
Member

nhorman commented Nov 5, 2024

there isn't an approach that won't require a restart of the irqbalance service, thats why I'm saying, if you're doing maintenance and want to suppress logging while you do so, just turn the service off during maintenance, i.e. instead of doing:

serivce irqbalance stop
edit env file
service irqbalance start
do maintenence
service irqbalance stop
edit env file
service irqbalance start

instead just do:

service irqbalance stop
do maintenence
service irqbalance start

If your maintenance is going to cause lots of churn in your irq registration, balancing won't do much good during that time anyway, just disable the service while you're doing it entirely, its simpler and doesn't require any code changes.

@liutgnu
Copy link
Contributor Author

liutgnu commented Nov 5, 2024

Hi @nhorman,

Oh I see your point, thanks again for your explaination! But there might be tasks other than just "maintenence", which requires irqbalance running and log suppression at the same time.

Though I'm not a hardware guy, but I can provide some info from our bug reporter:

"for example, when we have storage bays of 6 NVMe drives each that can disappear at any time (for maintenance or other user requested actions), and each one of those NVMe adapters has over 90 IRQs each, hundreds of these affinity messages go to the log. Some of our customers are quite squeamish with some logged messaged in general and I can imagine that hundreds of "No space left on device" messages will be somewhat troubling..."

Though this is a rare case to me, but I agree with him that, giving users an option to turn off the unwanted logs is good to have... Could you please provide more info for this feature? @[email protected]

Thanks,
Tao Liu

@nhorman
Copy link
Member

nhorman commented Nov 5, 2024

but...its not an ENOSPC error, its a error about not being able to change affinity. Surely your customers are able to tell the difference and understand that a daemon cant change the affinity of an irq that asynchronously disappeared?

If they can't you still have a bunch of options without having to resort to interrogating an environment variable for a message you consider special.

It sounds like you're using systemd. You can visually hide these messages by interrogating the log with a level filter. The message you're interested in is WARNING level, so you can just run journalctl -u irqbalance -p err to see err messages and above. You can also set MaxLevelStore in journald.conf to prevent warnings from getting recorded to the journal at all.

If you just want to see fewer of the messages you are concerned about, you can set the LogRateLimitIntervalSec and LogRateLimitBurst settings in the irqbalance unit file, so that multiple messages in quick succession get dropped.

You can also use LogFilterPatterns in the unit file to exclude these messages entirely.

if you want to hide the log entirely unless you look for it specifically you can set the LogNamespace setting in the unit file for irqbalance to mask it out of the general log readout.

If you're insistent on a code change, I'd be supportive of a new flag as you described (-q) that suppresses all log messages, but given the above mechanisms, that seems unnecessary.

But the bottom line for me, is that an irq that disappears unexpectedly is just that, unexpected, and deserves a warning to the user so they can investigate. Suppressing only that message is just sweeping valid customer questions under the rug.

@bkstratus
Copy link

Hello @nhorman,

I see your point about the ability to filter out these messages, but our customers can be very picky with what ends up in logs. When there are over 500 messages that belch out "Cannot change IRQ X affinity: No space left on device" when an I/O bay gets inserted, I can imagine that a customer could get very concerned about these occurrences.

There are other use cases in our systems where thousands of these messages are output for each break cycle.

Unfortunately, stopping and starting the irqbalance service is not an option for us here as actions that produce these messages may be spontaneous.

If you're insistent on a code change, I'd be supportive of a new flag as you described (-q) that suppresses all log messages, but given the above mechanisms, that seems unnecessary.

We are asking is for a way to silence these messages, a simple flag to suppress them. It can/would be ignored by 99.9% of users, but it would be nice to have the option to not see these warnings.

I thank you for your time and I appreciate your consideration here,

Regards,

Bill Kuzeja
Stratus Technologies

@nhorman
Copy link
Member

nhorman commented Nov 5, 2024

I see your point about the ability to filter out these messages, but our customers can be very picky with what ends up in logs. When there are over 500 messages that belch out "Cannot change IRQ X affinity: No space left on device" when an I/O bay gets inserted, I can imagine that a customer could get very concerned about these occurrences.

Ok, perhaps, but you disconnected devices with hundreds of irqs attached to them, while the irqbalance process was using them. How would you not expect to get warnings about that?

That warning was added in response to https://bugzilla.redhat.com/show_bug.cgi?id=2184735 because a user attempted to affine irqs to a cpu that was blacklisted and received no warning. Its a valid warning to produce. The fact that you get it alot is because you have alot of irqs, not because its a bad warning.

If you can propose a way in which irqbalance can differentiate between unexpected irq losses and valid losses, I'm happy to listen, but I'm not just going to add a knob that lets you squash a valid warning just because your customers might ask you about them.

There are other use cases in our systems where thousands of these messages are output for each break cycle.

Unfortunately, stopping and starting the irqbalance service is not an option for us here as actions that produce these messages may be spontaneous.

Then why are you pushing for this change? This change requires not only starting and stopping irqbalance twice, but editing an environment variable in between each to enable the suppression.

As I mentioned above, you have lots of tools to make this better for you, without any code change. As you are using systemd (or so it would seem), I would suggest adding LogRateLimitIntervalSec and LogRateLimitBurst to your irqbalance unit file. Then when you get hundreds of "cant change irq affinity messages" sytemd-journald will print the first one, and suppress the rest.

If you just want to be rid of them entirely, then add LogFilterPatterns="~Cannot change IRQ" to your irqbalance unit file, and the log messages you don't want to see will be removed entirely. I don't recommend this, and its good to know when this happens during normal operation, but if your goal is to make customers not bother you about this, it sounds like exactly what you want.

If you're insistent on a code change, I'd be supportive of a new flag as you described (-q) that suppresses all log messages, but given the above mechanisms, that seems unnecessary.

We are asking is for a way to silence these messages, a simple flag to suppress them. It can/would be ignored by 99.9% of users, but it would be nice to have the option to not see these warnings.

You have several methods to address this that don't require code changes, and can target exactly what you are trying to do. Is there some reason those aren't sufficient for you?

Either way, simple or not, whats proposed in this PR is really just a hack around a valid warning message. I'm sorry I'm not going to include it. If you want to eliminate the message, you have several existing tools to allow you to do that already, as described above. If those methods aren't sufficient for you, we can certainly talk about other solutions, but in its current form, this isn't going in.

@bkstratus
Copy link

Thanks Neil for your detailed answer!

Given the options you've provided, rate limiting the service should be a good compromise that will most likely fit our needs. It sounds like disabling these messages entirely may no be a good idea.

@nhorman
Copy link
Member

nhorman commented Nov 5, 2024

In all honesty, its really your choice, though the benefit of using the systemd facilities is that what you decide to do is both admin configurable, and modify-able in the field, if you have a need to change the setting. I would certainly encourage the rate limiting approach so that you can see at least one of the messages when it pops up, but given that this moves the decision of what to do into the system integrator space (i.e. you), you can make the call at will

@nhorman nhorman closed this Nov 5, 2024
@liutgnu
Copy link
Contributor Author

liutgnu commented Nov 5, 2024

Hi @nhorman,

Thanks for your suggestions, I agree that suppressing the logs from systemd side is a better approach, i.e LogRateLimitIntervalSec etc, unlickly I was not familiar with systemd and didn't know about the avaliable options :(, thanks again for your help!

Thanks,
Tao Liu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants