Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Q] Slow performance and OOM restarts #504

Open
lupuletic opened this issue Nov 18, 2022 · 1 comment
Open

[Q] Slow performance and OOM restarts #504

lupuletic opened this issue Nov 18, 2022 · 1 comment
Labels

Comments

@lupuletic
Copy link

Problem Description

We are running go-carbon 0.15.6 as a backend for go-carbonapi 0.14.0 and as the number of metrics has kept increasing (~1.6TB of *.wsp files), performance kept degrading. We're getting the OOM restarts of the docker containers and also errors in the logs saying "Could not Expand Globs - Context Cancelled"

For the go-carbon servers, the hardware is 8 CPU Cores, 48 GB RAM and a really fast storage / disk.
The current config:

[common]
user = "root"
graph-prefix = "carbon.agents.{host}"
metric-endpoint = "tcp://%%METRIC_DESTINATION%%:2003"
max-cpu = 6
metric-interval = "1m0s"

[whisper]
data-dir = "/opt/go-carbon/whisper"
schemas-file = "/etc/go-carbon/storage-schemas.conf"
aggregation-file = ""
workers = 8
max-updates-per-second = 400
sparse-create = true
flock = true
enabled = true
hash-filenames = true
remove-empty-file = true

[cache]
max-size = 400000000
write-strategy = "noop"

[udp]
enabled = false

[tcp]
listen = ":2003"
enabled = true
buffer-size = 2000000
compression = ""

[pickle]
enabled = false

[carbonlink]
enabled = false

[grpc]
enabled = false

[tags]
enabled = false

[carbonserver]
listen = ":80"
enabled = true
query-cache-enabled = true
query-cache-size-mb = 4096
find-cache-enabled = true
buckets = 10
max-globs = 10000
fail-on-max-globs = false
metrics-as-counters = false
trie-index = true
concurrent-index = true
realtime-index = 400000000
trigram-index = false
cache-scan = false
graphite-web-10-strict-mode = true
internal-stats-dir = "/opt/go-carbon/carbonserver"
read-timeout = "1m0s"
idle-timeout = "1m0s"
write-timeout = "1m0s"
scan-frequency = "10m0s"
stats-percentiles = [99, 95, 75, 50]

[dump]
enabled = false

[pprof]
enabled = false

[[logging]]
logger = ""
file = "stdout"
level = "error"
encoding = "mixed"
encoding-time = "iso8601"
encoding-duration = "seconds"

For go-carbonapi servers, the hardware is 14 CPU Cores and 16GB RAM and the config file is:

listen: "0.0.0.0:80"
concurency: 1000
cache:
   type: "mem"
   size_mb: 1024
   defaultTimeoutSec: 5
cpus: 0
tz: ""
sendGlobsAsIs: true
maxBatchSize: 5000
idleConnections: 100
pidFile: ""
expireDelaySec: 10
logger:
    - logger: ""
      file: "stdout"
      level: "error"
      encoding: "console"
      encodingTime: "iso8601"
      encodingDuration: "seconds"
upstreams:
    tldCacheDisabled: true
    doMultipleRequestsIfSplit: true
    buckets: 10
    timeouts:
        global: "12s"
        afterStarted: "10s"
        connect: "200ms"
    concurrencyLimit: 0
    keepAliveInterval: "10s"
    maxIdleConnsPerHost: 100
    maxGlobs: 5000
    maxBatchSize: 5000
    backends:
      - "example1"
      - "example2"

One problem we are aware of are metrics from K8s applications, where they create a lot of "dead" folders every-time a pod is being re-spun and therefore named differently. We're trying to move K8s apps to a different metrics solution, but in the meantime we have setup a cronjob to cleanup stale data / folders.

Can you please give us a hand with anything looking out of the ordinary within our config? We're also considering increasing the hardware spec for the go-carbon storage nodes to 12 CPU Cores and 64GB RAM, but we also believe some bits could also be improved within our configuration.

Many, many thanks!

@deniszh
Copy link
Member

deniszh commented Dec 2, 2022

Hi @lupuletic

Sorry for late reply, but from our experience your hardware is too limited for such load. Data size is not that important, but number of metrics it is. Clean up cronjob is a good thing, but it's not miracle. In our clusters we have millions of metrics (i.e. files per node), but for that you need memory for 1) go-carbon cache 2) go-carbon index 3) file system cache 4) file system itself. So, 64G RAM doesn't sound like bad idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants