Last Vacuum Always Errors on Hot Standby Cluster #359

mbisme · 2023-11-03T20:30:43Z

When using the last_vacuum check on a hot standby cluster a critical alert is always raised. This appears to be caused by the fact that a hot standby does not maintain metadata for when a vacuum job is replicated from the primary to the standby. When running the same check on the primary cluster there is no alert generated.

In attempting to fix this internally I have modified sub check_last_vacuum to borrow the code from sub check_is_hot_standby and only running the check if the cluster isn't a hot standby cluster. What would be the best way to address this issue, and would it be possible to get a more appropriate fix added directly to check_pgactivity?

Below is what I did to sub check_last_vacuum which appears to work for my needs, but I am not a programmer so forgive any stupidity that you find.

sub check_last_vacuum {
    my @rs;
    my @hosts;
    my %args          = %{ $_[0] };
    my $me            = 'POSTGRES_CHECK_LAST_VACUUM';
    my %queries       = (
        $PG_VERSION_90 => q{SELECT pg_is_in_recovery()}
    );

    @hosts = @{ parse_hosts %args };

    pod2usage(
        -message => 'FATAL: you must give only one host with service "check_last_vacuum".',
        -exitval => 127
    ) if @hosts != 1;

    is_compat $hosts[0], 'is_hot_standby', $PG_VERSION_90 or exit 1;
    @rs = @{ query_ver( $hosts[0], %queries )->[0] };

    return check_last_maintenance( 'vacuum', @_ ) if $rs[0] eq "f";
    return status_ok( $me, [ "Cluster is hot standby, skipping check" ] );
}

The text was updated successfully, but these errors were encountered:

Krysztophe · 2023-11-27T17:10:53Z

I have not given much thought to this problem (yet), but I would have put the pg_is_in_recovery() in the main query and added a test.

There is note in the doc:

B: this service does not raise alerts if the database had strictly
no writes since last call. In consequence, a read-only database can have
its oldest analyze reported in perfdata way after your thresholds, but not
raise any alerts.

It does not seem to cover the standbys, or the test is broken.

I put this on our TODO list this week.

Krysztophe · 2023-11-29T16:00:12Z

The service definitely assumes it will be executed on primary servers only.

We usually do not deploy the service on standbys. Can't you avoid it? Can you add a dependance on the is_master service ?

Anyway, a CRITICAL is a bit strong. After discussion with @ioguix, maybe we could fetch the role status while fetching the backend version. Then we can decide for each service, what to do.

On standbys:

it should not return OK as it would give a false sense of security
what about returning UNKNOWN ?
maybe add an option --ok-on-standby to return OK if you really must deploy on standbys (may apply to other services) ? Other services may use this option.

What do you think?

mbisme · 2023-11-30T16:12:00Z

Hi @Krysztophe,

Thanks for the feedback. It would be difficult to not have it deploy on standbys in our environment as we have many and use an automated process to failover to them whenever something happens to the primary, and typically won't auto-failback unless there is a reason to. So a standby can be the primary for several months, and with the size of our organization it isn't feasible to try to deploy the service on Nagios when a failover happens and remove it when it changes back to the original primary.

I can understand not wanting to return OK by default, as you are correct that it would potentially give a false sense of security if you are not expecting that behavior. So I think having it default as UNKNOWN would be a better default value.

If adding an --ok-on-standby option can be implemented that is what my organization would like to be able to use as that will meet our needs and could also offer benefits for other service checks.

Thanks for all your help!

Krysztophe · 2023-11-30T16:26:33Z

Personnaly, I would prefer UNKNOWN and a --ok-on-standby parameter.

A few services already test pg_is_in_recovery, I suppose it could be added to this service too.

@ioguix?

mbisme mentioned this issue Nov 27, 2023

Fix for #359 (Last Vacuum Always Errors on Hot Standby Cluster) #360

Open

Krysztophe added the discussion label Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Last Vacuum Always Errors on Hot Standby Cluster #359

Last Vacuum Always Errors on Hot Standby Cluster #359

mbisme commented Nov 3, 2023

Krysztophe commented Nov 27, 2023

Krysztophe commented Nov 29, 2023 •

edited

Loading

mbisme commented Nov 30, 2023

Krysztophe commented Nov 30, 2023

Last Vacuum Always Errors on Hot Standby Cluster #359

Last Vacuum Always Errors on Hot Standby Cluster #359

Comments

mbisme commented Nov 3, 2023

Krysztophe commented Nov 27, 2023

Krysztophe commented Nov 29, 2023 • edited Loading

mbisme commented Nov 30, 2023

Krysztophe commented Nov 30, 2023

Krysztophe commented Nov 29, 2023 •

edited

Loading