Principal=>subordinate coordination for logrotation (grafana-agent causing a lot of unreleased files and huge syslog) #153

taurus-forever · 2024-07-16T13:32:26Z

Bug Description

Hi,

Please check the complete issue description in PostgreSQL repo:
canonical/postgresql-operator#524

TL;DR: PostgreSQL charm rotates logs but doesn't send any signals to subordinated grafana-agent causing a lot of unreleased files and huge syslog => downtime.

It is a cross-team ticket to build a solution here.

To Reproduce

See steps to reproduce in canonical/postgresql-operator#524

Environment

See Versions in canonical/postgresql-operator#524

Relevant log output

No relevant logs output.

Additional context

Proposals:

(preferred) subordinated charm(s) should bring/install logrotate config, so principal charm will execute all of them.
principal charm should (somehow) detects subordinates and inform them (somehow) using signal (SIGHUP?)
Better ideas are welcome!

taurus-forever · 2024-08-05T12:44:55Z

@simskij do you have good ideas how to proceed here?
DB charm should somehow aware about COS promtail to SIGHUP it properly.

lucabello · 2024-08-30T13:57:17Z

@taurus-forever Do you think this would be solved by not getting all the files from /var/log/** and instead only get a few that you specify?

taurus-forever · 2024-09-06T10:08:17Z

@lucabello AFAIK, no. The grafana-agent charm / promtail binary (did?) read some log files from disk (to send them to Loki). PostgreSQL charm rotates logs, but didn't send any signals to COS to close the current descriptor and reopen the file (as the old one moved to archive folder).

IMHO, we have two options:

send all logs through pebble and never read logs on disk. AFAIK it is a current COS plan. Pros: 1) VM has no Pebble. Journald has performance issues, etc. 2) some services (e.g. pgbackest) requires complex files structure and do not support syslog approach.
teach principal/PostgreSQL charm to send signals on logrorate to subordinated/grafa-agent.
...
Better ideas are welcome!

sed-i · 2024-09-06T13:09:12Z

Grafana folks recommend using the "rename and create" rotation strategy over "copy and truncate". Do you know what rotation strategy is set up for postgresql? Is that configurable on your side?
Are you saying that slurping the archive patroni/patroni.log.* is incorrect, and we should only be reading the one file patroni/patroni.log? That could be related to the glob "/var/log/**/*log" we have in place. But I'm surprised that filenames such as patroni.log.6545 or patroni.log.10080 would match the glob. I confirmed that python's glob meets my expectation ([print(p.name) for p in Path(".").glob("**/*log")]) but not sure how it's impl'd in gagent.

I'd appreciate your input on the above @taurus-forever.

dragomirp · 2024-09-09T14:11:57Z

Grafana folks recommend using the "rename and create" rotation strategy over "copy and truncate". Do you know what rotation strategy is set up for postgresql? Is that configurable on your side?

Patroni should be using an extended version of Python's RotatingFileHandler. Python's docs indicate “rename and create”. We can only configure size and amount of files to keep.

Postgresql is configured to keep a week's worth of per minute logs that are truncated each minute. This is behaviour configured by us, but it is spec behaviour so discussions would be necessary to change it.

Both Patroni and Postgresql try to keep about a week's worth of per minute logs, so that should be about 10k files for each.

Are you saying that slurping the archive patroni/patroni.log.* is incorrect, and we should only be reading the one file patroni/patroni.log? That could be related to the glob "/var/log/**/*log" we have in place. But I'm surprised that filenames such as patroni.log.6545 or patroni.log.10080 would match the glob. I confirmed that python's glob meets my expectation ([print(p.name) for p in Path(".").glob("**/*log")]) but not sure how it's impl'd in gagent.

I don't know how the agent detects log changes, but for Patroni only the last few logs should be relevant, the deeper backlog was likely already synced and shouldn't be changing. For Postgresql things are trickier, since the files are the same (postgresql-%w_%H%M.log) but will be overwritten when the same time comes.

Vultaire · 2024-12-10T18:17:58Z

I'm hoping this is still on peoples' radar - this interaction between postgresql and grafana-agent is creating truly excessive logging in some environments. I'm seeing roughly 8 gigs of log lines a day just related to events related to the patroni log files. This is quickly filling the disk of the hosts in one of the clouds I am working on. (10,080 log lines * roughly 300 characters per line * 2,800 times a day (twice a minute) = 8,467,200,000 bytes as a rough estimate)

Vultaire · 2024-12-13T16:39:56Z

Adding additional context, per discussion with the obs team: here is the related job in /etc/grafana-agent.yaml:

    - job_name: charmed-postgresql-common-var-log-patroni
      pipeline_stages:
      - drop:
          expression: .*file is a directory.*
      relabel_configs:
      - replacement: /patroni
        source_labels:
        - __path__
        target_label: path
      static_configs:
      - labels:
          __path__: /snap/grafana-agent/51/shared-logs/patroni/**
          job: charmed-postgresql-common-var-log-patroni
          juju_application: postgresql
          juju_charm: grafana-agent
          juju_model: landscape-dev
          juju_model_uuid: $REDACTED
          juju_unit: postgresql/1
          snap_name: charmed-postgresql
        targets:
        - localhost

sed-i · 2024-12-13T16:51:45Z

Summary of our call:

patroni.log.N files are rotated, and there are a lot of them, for some reason - from patroni.log.1 to patroni.log.10080.
grafana agent periodically logs to syslog a CREATE event for all of the patroni log files, including the old ones that were already rotated.
patroni rotates files every minute, so the stat of all files changes every minute.
Even when grafana-agent is down, the patroni files are still rotated every minute, so grafana-agent isn't "locking" anything.

Per comment by @dragomirp this is by design:

Postgresql is configured to keep a week's worth of per minute logs that are truncated each minute. This is behaviour configured by us, but it is spec behaviour so discussions would be necessary to change it.

We could see how ergonomic it would be to make the ** glob configurable so that grafana agent tracks only a subset of patroni files:

grafana-agent-operator/src/charm.py

Lines 648 to 654 in 5ea3a57

    
           job = self._snap_plug_job( 
        
               endpoint.owner, 
        
               f"{path}/**", 
        
               topology.application, 
        
               str(topology.unit), 
        
               self._path_label(path), 
        
           )

That being said, the patroni log files seem to have little value:

root@dev:/var/snap/charmed-postgresql/common/var/log/patroni# cat patroni.log.7404
2024-12-08 13:25:52 UTC [9279]: INFO: no action. I am (postgresql-1), the leader with the lock 
2024-12-08 13:26:02 UTC [9279]: INFO: no action. I am (postgresql-1), the leader with the lock 
2024-12-08 13:26:12 UTC [9279]: INFO: no action. I am (postgresql-1), the leader with the lock 
2024-12-08 13:26:22 UTC [9279]: INFO: no action. I am (postgresql-1), the leader with the lock
2024-12-08 13:26:32 UTC [9279]: INFO: no action. I am (postgresql-1), the leader with the lock 
2024-12-08 13:26:42 UTC [9279]: INFO: no action. I am (postgresql-1), the leader with the lock

root@dev:/var/snap/charmed-postgresql/common/var/log/patroni# cat patroni.log.282
2024-12-13 12:07:03 UTC [9279]: ERROR: Error communicating with DCS 
2024-12-13 12:07:04 UTC [9279]: INFO: demoting self because DCS is not accessible and I was a leader 
2024-12-13 12:07:04 UTC [9279]: INFO: Demoting self (offline) 
2024-12-13 12:07:04 UTC [9279]: WARNING: Loop time exceeded, rescheduling immediately. 
2024-12-13 12:07:04 UTC [9279]: INFO: Lock owner: postgresql-1; I am postgresql-1 
2024-12-13 12:07:04 UTC [9279]: INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader

I wonder if:

the logging policy for patroni can be decoupled from postgres, i.e. have different retention and rotation settings.
reduce the rotation period to e.g. 1 day, so way have much less files to track.

Another option: see if we can have grafana agent not log CREATE events.

taurus-forever added Status: Triage Type: Bug labels Jul 16, 2024

taurus-forever mentioned this issue Jul 16, 2024

Patroni logs with grafana-agent causing a lot of unrelease file and huge syslog canonical/postgresql-operator#524

Open

Vultaire mentioned this issue Dec 10, 2024

Postgresql with grafana-agent generates excessively large /var/log/syslog files canonical/postgresql-operator#696

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Principal=>subordinate coordination for logrotation (grafana-agent causing a lot of unreleased files and huge syslog) #153

Principal=>subordinate coordination for logrotation (grafana-agent causing a lot of unreleased files and huge syslog) #153

taurus-forever commented Jul 16, 2024

taurus-forever commented Aug 5, 2024

lucabello commented Aug 30, 2024 •

edited

Loading

taurus-forever commented Sep 6, 2024 •

edited

Loading

sed-i commented Sep 6, 2024

dragomirp commented Sep 9, 2024

Vultaire commented Dec 10, 2024

Vultaire commented Dec 13, 2024 •

edited

Loading

sed-i commented Dec 13, 2024 •

edited

Loading

Principal=>subordinate coordination for logrotation (grafana-agent causing a lot of unreleased files and huge syslog) #153

Principal=>subordinate coordination for logrotation (grafana-agent causing a lot of unreleased files and huge syslog) #153

Comments

taurus-forever commented Jul 16, 2024

Bug Description

To Reproduce

Environment

Relevant log output

Additional context

taurus-forever commented Aug 5, 2024

lucabello commented Aug 30, 2024 • edited Loading

taurus-forever commented Sep 6, 2024 • edited Loading

sed-i commented Sep 6, 2024

dragomirp commented Sep 9, 2024

Vultaire commented Dec 10, 2024

Vultaire commented Dec 13, 2024 • edited Loading

sed-i commented Dec 13, 2024 • edited Loading

lucabello commented Aug 30, 2024 •

edited

Loading

taurus-forever commented Sep 6, 2024 •

edited

Loading

Vultaire commented Dec 13, 2024 •

edited

Loading

sed-i commented Dec 13, 2024 •

edited

Loading