Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Incorrectly reported used_memory_dataset and used_memory_dataset_perc #1373

Open
martinrvisser opened this issue Nov 28, 2024 · 6 comments

Comments

@martinrvisser
Copy link

Describe the bug

Incorrect numbers reported from INFO for used_memory_dataset and used_memory_dataset_perc for the replica only.

Two instances valkey_version:8.0.0
Primary port 8001

Role:master
valkey-cli -p 8001 dbsize
(integer) 1

valkey-cli -p 8001 info memory
# Memory
used_memory:12214912
used_memory_human:11.65M
used_memory_rss:15548416
used_memory_rss_human:14.83M
used_memory_peak:81638856
used_memory_peak_human:77.86M
used_memory_peak_perc:14.96%
used_memory_overhead:11623176
used_memory_startup:1009280
used_memory_dataset:591736
used_memory_dataset_perc:5.28%
allocator_allocated:12371384
allocator_active:12693504
allocator_resident:19349504
allocator_muzzy:0
total_system_memory:24964886528
total_system_memory_human:23.25G
used_memory_lua:31744
used_memory_vm_eval:31744
used_memory_lua_human:31.00K
used_memory_scripts_eval:0
number_of_cached_scripts:0
number_of_functions:0
number_of_libraries:0
used_memory_vm_functions:33792
used_memory_vm_total:65536
used_memory_vm_total_human:64.00K
used_memory_functions:184
used_memory_scripts:184
used_memory_scripts_human:184B
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.03
allocator_frag_bytes:322120
allocator_rss_ratio:1.52
allocator_rss_bytes:6656000
rss_overhead_ratio:0.80
rss_overhead_bytes:-3801088
mem_fragmentation_ratio:1.27
mem_fragmentation_bytes:3351432
mem_not_counted_for_evict:12280
mem_replication_backlog:10485904
mem_total_replication_buffers:10539056
mem_clients_slaves:53296
mem_clients_normal:74152
mem_cluster_links:0
mem_aof_buffer:8
mem_allocator:jemalloc-5.3.0
mem_overhead_db_hashtable_rehashing:0
active_defrag_running:0
lazyfree_pending_objects:0
lazyfreed_objects:0

Replica port 8010
valkey-cli -p 8010 info memory

Memory

used_memory:12191720
used_memory_human:11.63M
used_memory_rss:16850944
used_memory_rss_human:16.07M
used_memory_peak:39387208
used_memory_peak_human:37.56M
used_memory_peak_perc:30.95%
used_memory_overhead:12268439
used_memory_startup:1009256
used_memory_dataset:18446744073709474897
used_memory_dataset_perc:164961355563008.00%

allocator_allocated:12369896
allocator_active:12759040
allocator_resident:17674240
allocator_muzzy:0
total_system_memory:24964886528
total_system_memory_human:23.25G
used_memory_lua:31744
used_memory_vm_eval:31744
used_memory_lua_human:31.00K
used_memory_scripts_eval:0
number_of_cached_scripts:0
number_of_functions:0
number_of_libraries:0
used_memory_vm_functions:33792
used_memory_vm_total:65536
used_memory_vm_total_human:64.00K
used_memory_functions:184
used_memory_scripts:184
used_memory_scripts_human:184B
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:1.03
allocator_frag_bytes:389144
allocator_rss_ratio:1.39
allocator_rss_bytes:4915200
rss_overhead_ratio:0.95
rss_overhead_bytes:-823296
mem_fragmentation_ratio:1.38
mem_fragmentation_bytes:4677128
mem_not_counted_for_evict:12280
mem_replication_backlog:10539200
mem_total_replication_buffers:10539056
mem_clients_slaves:0
mem_clients_normal:719439
mem_cluster_links:0
mem_aof_buffer:8
mem_allocator:jemalloc-5.3.0
mem_overhead_db_hashtable_rehashing:0
active_defrag_running:0
lazyfree_pending_objects:0
lazyfreed_objects:1375772

To reproduce
Unknown

Expected behavior

Correct values reported

Additional information

Any additional information that is relevant to the problem.

@zvi-code
Copy link
Contributor

My guess is that this is due to the non-atomic nature of memory sampling. zmalloc_used_memory goes over per-thread counter array and sums the values. In case of alloc&free done from different threads, one thread will report positive value and another negative. Depending on the order, this can lead to overall returned value from jemalloc to be lower or higher than actual value (it can even be negative).

Specifically here: We see there where lazyfreed_objects, so these where freed in the context of bio_lazy free thread (not main thread).
We see that used_memory_overhead:12268439 is higher than used_memory:12191720 this indicates this is what happened here (cause overhead can be higher that total in reality).

This is also the reason for the value you get used_memory_dataset:18446744073709474897 because

used_memory - used_memory_overhead = dataset 
=>
12191720-12268439 = 18446744073709474897

I would expect this to be transient issue, can you confirm?

Regarding a fix - we can possibly have internal retry if result is clearly not correct

@martinrvisser
Copy link
Author

When you say transient, do you mean you would expect this to resolve itself? It is also showing similar odd number in bytes per key.

127.0.0.1:8010> MEMORY STATS
 1) "peak.allocated"
 2) (integer) 39387208
 3) "total.allocated"
 4) (integer) 12224920
 5) "startup.allocated"
 6) (integer) 1009256
 7) "replication.backlog"
 8) (integer) 10539216
 9) "clients.slaves"
10) (integer) 0
11) "clients.normal"
12) (integer) 720335
13) "cluster.links"
14) (integer) 0
15) "aof.buffer"
16) (integer) 8
17) "lua.caches"
18) (integer) 0
19) "functions.caches"
20) (integer) 184
21) "db.0"
22) 1) "overhead.hashtable.main"
    2) (integer) 248
    3) "overhead.hashtable.expires"
    4) (integer) 104
23) "overhead.db.hashtable.lut"
24) (integer) 32
25) "overhead.db.hashtable.rehashing"
26) (integer) 0
27) "overhead.total"
28) (integer) 12269351
29) "db.dict.rehashing.count"
30) (integer) 0
31) "keys.count"
32) (integer) 1
33) "keys.bytes-per-key"
34) (integer) -44431
35) "dataset.bytes"
36) (integer) -44431
37) "dataset.percentage"
38) "164473037914112"
39) "peak.percentage"
40) "31.03779411315918"
41) "allocator.allocated"
42) (integer) 12426040
43) "allocator.active"
44) (integer) 12812288
45) "allocator.resident"
46) (integer) 17993728
47) "allocator.muzzy"
48) (integer) 0
49) "allocator-fragmentation.ratio"
50) "1.0310837030410767"
51) "allocator-fragmentation.bytes"
52) (integer) 386248
53) "allocator-rss.ratio"
54) "1.404411792755127"
55) "allocator-rss.bytes"
56) (integer) 5181440
57) "rss-overhead.ratio"
58) "0.9410425424575806"
59) "rss-overhead.bytes"
60) (integer) -1060864
61) "fragmentation"
62) "1.3851194381713867"
63) "fragmentation.bytes"
64) (integer) 4708024

@zvi-code
Copy link
Contributor

Yes, I expect repeated call to show correct numbers. Is it?

@martinrvisser
Copy link
Author

Confirmed, eventually it came down.

used_memory_dataset:146145
used_memory_dataset_perc:1.28%

@zvi-code
Copy link
Contributor

zvi-code commented Dec 3, 2024

The transient sate i was referring to should be resolved immediately, was that the case? on next invocation usually.
If this is the case, It's by design and the bug can be closed IMO.

It's a valid ask, though, to propose\request a feature that would handle (retry?) in case of temporal unexpected state

@hpatro
Copy link
Contributor

hpatro commented Dec 3, 2024

The transient sate i was referring to should be resolved immediately, was that the case? on next invocation usually. If this is the case, It's by design and the bug can be closed IMO.

It's a valid ask, though, to propose\request a feature that would handle (retry?) in case of temporal unexpected state

@zvi-code Shouldn't we try and fix it rather than letting it underflow ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants