Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Principal=>subordinate coordination for logrotation (grafana-agent causing a lot of unreleased files and huge syslog) #153

Open
taurus-forever opened this issue Jul 16, 2024 · 8 comments

Comments

@taurus-forever
Copy link

Bug Description

Hi,

Please check the complete issue description in PostgreSQL repo:
canonical/postgresql-operator#524

TL;DR: PostgreSQL charm rotates logs but doesn't send any signals to subordinated grafana-agent causing a lot of unreleased files and huge syslog => downtime.

It is a cross-team ticket to build a solution here.

To Reproduce

See steps to reproduce in canonical/postgresql-operator#524

Environment

See Versions in canonical/postgresql-operator#524

Relevant log output

No relevant logs output.

Additional context

Proposals:

  • (preferred) subordinated charm(s) should bring/install logrotate config, so principal charm will execute all of them.
  • principal charm should (somehow) detects subordinates and inform them (somehow) using signal (SIGHUP?)
    Better ideas are welcome!
@taurus-forever
Copy link
Author

@simskij do you have good ideas how to proceed here?
DB charm should somehow aware about COS promtail to SIGHUP it properly.

@lucabello
Copy link
Contributor

lucabello commented Aug 30, 2024

@taurus-forever Do you think this would be solved by not getting all the files from /var/log/** and instead only get a few that you specify?

@taurus-forever
Copy link
Author

taurus-forever commented Sep 6, 2024

@lucabello AFAIK, no. The grafana-agent charm / promtail binary (did?) read some log files from disk (to send them to Loki). PostgreSQL charm rotates logs, but didn't send any signals to COS to close the current descriptor and reopen the file (as the old one moved to archive folder).

IMHO, we have two options:

  • send all logs through pebble and never read logs on disk. AFAIK it is a current COS plan. Pros: 1) VM has no Pebble. Journald has performance issues, etc. 2) some services (e.g. pgbackest) requires complex files structure and do not support syslog approach.
  • teach principal/PostgreSQL charm to send signals on logrorate to subordinated/grafa-agent.
  • ...
    Better ideas are welcome!

@sed-i
Copy link
Contributor

sed-i commented Sep 6, 2024

  • Grafana folks recommend using the "rename and create" rotation strategy over "copy and truncate". Do you know what rotation strategy is set up for postgresql? Is that configurable on your side?
  • Are you saying that slurping the archive patroni/patroni.log.* is incorrect, and we should only be reading the one file patroni/patroni.log? That could be related to the glob "/var/log/**/*log" we have in place. But I'm surprised that filenames such as patroni.log.6545 or patroni.log.10080 would match the glob. I confirmed that python's glob meets my expectation ([print(p.name) for p in Path(".").glob("**/*log")]) but not sure how it's impl'd in gagent.

I'd appreciate your input on the above @taurus-forever.

@dragomirp
Copy link
Contributor

  • Grafana folks recommend using the "rename and create" rotation strategy over "copy and truncate". Do you know what rotation strategy is set up for postgresql? Is that configurable on your side?

Patroni should be using an extended version of Python's RotatingFileHandler. Python's docs indicate “rename and create”. We can only configure size and amount of files to keep.

Postgresql is configured to keep a week's worth of per minute logs that are truncated each minute. This is behaviour configured by us, but it is spec behaviour so discussions would be necessary to change it.

Both Patroni and Postgresql try to keep about a week's worth of per minute logs, so that should be about 10k files for each.

  • Are you saying that slurping the archive patroni/patroni.log.* is incorrect, and we should only be reading the one file patroni/patroni.log? That could be related to the glob "/var/log/**/*log" we have in place. But I'm surprised that filenames such as patroni.log.6545 or patroni.log.10080 would match the glob. I confirmed that python's glob meets my expectation ([print(p.name) for p in Path(".").glob("**/*log")]) but not sure how it's impl'd in gagent.

I don't know how the agent detects log changes, but for Patroni only the last few logs should be relevant, the deeper backlog was likely already synced and shouldn't be changing. For Postgresql things are trickier, since the files are the same (postgresql-%w_%H%M.log) but will be overwritten when the same time comes.

@Vultaire
Copy link

I'm hoping this is still on peoples' radar - this interaction between postgresql and grafana-agent is creating truly excessive logging in some environments. I'm seeing roughly 8 gigs of log lines a day just related to events related to the patroni log files. This is quickly filling the disk of the hosts in one of the clouds I am working on. (10,080 log lines * roughly 300 characters per line * 2,800 times a day (twice a minute) = 8,467,200,000 bytes as a rough estimate)

@Vultaire
Copy link

Vultaire commented Dec 13, 2024

Adding additional context, per discussion with the obs team: here is the related job in /etc/grafana-agent.yaml:

    - job_name: charmed-postgresql-common-var-log-patroni
      pipeline_stages:
      - drop:
          expression: .*file is a directory.*
      relabel_configs:
      - replacement: /patroni
        source_labels:
        - __path__
        target_label: path
      static_configs:
      - labels:
          __path__: /snap/grafana-agent/51/shared-logs/patroni/**
          job: charmed-postgresql-common-var-log-patroni
          juju_application: postgresql
          juju_charm: grafana-agent
          juju_model: landscape-dev
          juju_model_uuid: $REDACTED
          juju_unit: postgresql/1
          snap_name: charmed-postgresql
        targets:
        - localhost

@sed-i
Copy link
Contributor

sed-i commented Dec 13, 2024

Summary of our call:

  • patroni.log.N files are rotated, and there are a lot of them, for some reason - from patroni.log.1 to patroni.log.10080.
  • grafana agent periodically logs to syslog a CREATE event for all of the patroni log files, including the old ones that were already rotated.
  • patroni rotates files every minute, so the stat of all files changes every minute.
  • Even when grafana-agent is down, the patroni files are still rotated every minute, so grafana-agent isn't "locking" anything.

Per comment by @dragomirp this is by design:

Postgresql is configured to keep a week's worth of per minute logs that are truncated each minute. This is behaviour configured by us, but it is spec behaviour so discussions would be necessary to change it.

We could see how ergonomic it would be to make the ** glob configurable so that grafana agent tracks only a subset of patroni files:

job = self._snap_plug_job(
endpoint.owner,
f"{path}/**",
topology.application,
str(topology.unit),
self._path_label(path),
)

That being said, the patroni log files seem to have little value:

root@dev:/var/snap/charmed-postgresql/common/var/log/patroni# cat patroni.log.7404
2024-12-08 13:25:52 UTC [9279]: INFO: no action. I am (postgresql-1), the leader with the lock 
2024-12-08 13:26:02 UTC [9279]: INFO: no action. I am (postgresql-1), the leader with the lock 
2024-12-08 13:26:12 UTC [9279]: INFO: no action. I am (postgresql-1), the leader with the lock 
2024-12-08 13:26:22 UTC [9279]: INFO: no action. I am (postgresql-1), the leader with the lock
2024-12-08 13:26:32 UTC [9279]: INFO: no action. I am (postgresql-1), the leader with the lock 
2024-12-08 13:26:42 UTC [9279]: INFO: no action. I am (postgresql-1), the leader with the lock
root@dev:/var/snap/charmed-postgresql/common/var/log/patroni# cat patroni.log.282
2024-12-13 12:07:03 UTC [9279]: ERROR: Error communicating with DCS 
2024-12-13 12:07:04 UTC [9279]: INFO: demoting self because DCS is not accessible and I was a leader 
2024-12-13 12:07:04 UTC [9279]: INFO: Demoting self (offline) 
2024-12-13 12:07:04 UTC [9279]: WARNING: Loop time exceeded, rescheduling immediately. 
2024-12-13 12:07:04 UTC [9279]: INFO: Lock owner: postgresql-1; I am postgresql-1 
2024-12-13 12:07:04 UTC [9279]: INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader

I wonder if:

  • the logging policy for patroni can be decoupled from postgres, i.e. have different retention and rotation settings.
  • reduce the rotation period to e.g. 1 day, so way have much less files to track.

Another option: see if we can have grafana agent not log CREATE events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants