Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Last Vacuum Always Errors on Hot Standby Cluster #359

Open
mbisme opened this issue Nov 3, 2023 · 4 comments
Open

Last Vacuum Always Errors on Hot Standby Cluster #359

mbisme opened this issue Nov 3, 2023 · 4 comments

Comments

@mbisme
Copy link

mbisme commented Nov 3, 2023

When using the last_vacuum check on a hot standby cluster a critical alert is always raised. This appears to be caused by the fact that a hot standby does not maintain metadata for when a vacuum job is replicated from the primary to the standby. When running the same check on the primary cluster there is no alert generated.

In attempting to fix this internally I have modified sub check_last_vacuum to borrow the code from sub check_is_hot_standby and only running the check if the cluster isn't a hot standby cluster. What would be the best way to address this issue, and would it be possible to get a more appropriate fix added directly to check_pgactivity?

Below is what I did to sub check_last_vacuum which appears to work for my needs, but I am not a programmer so forgive any stupidity that you find.

sub check_last_vacuum {
    my @rs;
    my @hosts;
    my %args          = %{ $_[0] };
    my $me            = 'POSTGRES_CHECK_LAST_VACUUM';
    my %queries       = (
        $PG_VERSION_90 => q{SELECT pg_is_in_recovery()}
    );

    @hosts = @{ parse_hosts %args };

    pod2usage(
        -message => 'FATAL: you must give only one host with service "check_last_vacuum".',
        -exitval => 127
    ) if @hosts != 1;

    is_compat $hosts[0], 'is_hot_standby', $PG_VERSION_90 or exit 1;
    @rs = @{ query_ver( $hosts[0], %queries )->[0] };

    return check_last_maintenance( 'vacuum', @_ ) if $rs[0] eq "f";
    return status_ok( $me, [ "Cluster is hot standby, skipping check" ] );
}
@Krysztophe
Copy link
Collaborator

I have not given much thought to this problem (yet), but I would have put the pg_is_in_recovery() in the main query and added a test.

There is note in the doc:

B: this service does not raise alerts if the database had strictly
no writes since last call. In consequence, a read-only database can have
its oldest analyze reported in perfdata way after your thresholds, but not
raise any alerts.

It does not seem to cover the standbys, or the test is broken.

I put this on our TODO list this week.

@Krysztophe
Copy link
Collaborator

Krysztophe commented Nov 29, 2023

The service definitely assumes it will be executed on primary servers only.

We usually do not deploy the service on standbys. Can't you avoid it? Can you add a dependance on the is_master service ?

Anyway, a CRITICAL is a bit strong. After discussion with @ioguix, maybe we could fetch the role status while fetching the backend version. Then we can decide for each service, what to do.

On standbys:

  • it should not return OK as it would give a false sense of security
  • what about returning UNKNOWN ?
  • maybe add an option --ok-on-standby to return OK if you really must deploy on standbys (may apply to other services) ? Other services may use this option.

What do you think?

@mbisme
Copy link
Author

mbisme commented Nov 30, 2023

Hi @Krysztophe,

Thanks for the feedback. It would be difficult to not have it deploy on standbys in our environment as we have many and use an automated process to failover to them whenever something happens to the primary, and typically won't auto-failback unless there is a reason to. So a standby can be the primary for several months, and with the size of our organization it isn't feasible to try to deploy the service on Nagios when a failover happens and remove it when it changes back to the original primary.

I can understand not wanting to return OK by default, as you are correct that it would potentially give a false sense of security if you are not expecting that behavior. So I think having it default as UNKNOWN would be a better default value.

If adding an --ok-on-standby option can be implemented that is what my organization would like to be able to use as that will meet our needs and could also offer benefits for other service checks.

Thanks for all your help!

@Krysztophe
Copy link
Collaborator

Personnaly, I would prefer UNKNOWN and a --ok-on-standby parameter.

A few services already test pg_is_in_recovery, I suppose it could be added to this service too.

@ioguix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants