Introduce `commandlog heavytraffic` to record big response packet. #336

CharlesChen888 · 2024-04-19T09:48:32Z

Why fat log?

We already have slowlog, which keeps a record of recent queries that take relatively long time to execute. However sometimes it is not enough for debugging.

For example, we often receive questions from users like: "Why is my QPS not high, but the network flow is substantial?" or "Why did the network flow suddenly increase?" The situation is network flow suddenly increases without significant higher QPS.

The most common reason is that users are accessing big keys, like getting a long string, or using commands like keys, scan, hget, zrange to request too many elements at once. This is common among our users.

Here a new log is introduced to keep a record when a query triggers a large response (and we may call this log "fatlog"), in which the command, response size, time stamp, client name and client network address are logged. This can help to monitor traffic flow spikes and big keys. And it is a quick way to pinpoint the problem.

Alternative solution?

Slowlog: the command with big response packet may not be slow enough to trigger slowlog.
Big key: we can detect big keys, but we are not sure whether they are accessed at a certain time, and accessing a big key does not necessarily generate large responses.
Client traffic statistics: we may need to locate the exact commands.

Detail introduction

Just like slowlog, fatlog is not actually logged in a file, and is accessible through a series of commands:

FATLOG GET [count]
FATLOG LEN
FATLOG RESET

127.0.0.1:6379> config set fatlog-log-bigger-than 10
OK
127.0.0.1:6379> set a 1234567890
OK
127.0.0.1:6379> get a
"1234567890"
127.0.0.1:6379> fatlog get
1) 1) (integer) 3
   2) (integer) 1713516667
   3) (integer) 17
   4) 1) "get"
      2) "a"
   5) "127.0.0.1:62969"
   6) ""

also a help command is provided:

FATLOG HELP

And just like slowlog, two config items are provided to set how many log items are preserved and the response size threshold of being logged.

fatlog-max-len (default value: 128)
fatlog-log-bigger-than (default value: 1MB)

About implementation

You will find that a lot changes have been appllied to slowlog.c. That is because the implementation of fatlog is basically reusing the code of slowlog, since they have similar format and are controled by similar commands. So what used to be slowlog.c now contains code of both slowlog and fatlog.

madolson · 2024-04-21T18:31:14Z

I'm not sure about this specific feature. At least from what I've seen, it's less common for a single command to produce an unexpectedly large result without it also showing up in the slowlog.

CharlesChen888 · 2024-04-22T05:52:10Z

At least from what I've seen, it's less common for a single command to produce an unexpectedly large result without it also showing up in the slowlog.

This does happen. Some of our users are very sensitive to network flow spikes, and they couldn't find any clue in slowlog, so thay asked us about it.

Slowlog only records time of command executions ss. But a hash lookup and a memory copy (simple get command, even with a big response packet of 1MB) could be really quick, (compared to the default slowlog threshold, 10ms).

hwware · 2024-04-24T14:44:10Z

@CharlesChen888 Before I review your PR, could you please provide some background information about why you want to update the slowlogCommand to heavyLoadLogCommand? Because in the top comment, you mention you want to add some new commands for fatlog, but in your pr, you make some changes for slowlog files. I am confused it.

CharlesChen888 · 2024-04-25T00:49:43Z

@hwware This is because fatlog shares a lot code of slowlog. So the function used to be called slowlogCommand now contains code of both slowlog command and fatlog command. I tried to think of a name meaning "both slow and fat", "heavy load" seems to cover both situation.

madolson · 2024-05-06T23:04:43Z

This does happen. Some of our users are very sensitive to network flow spikes, and they couldn't find any clue in slowlog, so they asked us about it.

Can you be more precise than this? People ask us at AWS too about issues, but we've always been able to debug this type of problem without what you proposed. What do you mean they are sensitive to "network flow spikes". We've seen a lot more issues with a sudden burst of smaller commands than this type of "single large" packet issue.

CharlesChen888 · 2024-05-10T06:12:32Z

@madolson

We've seen a lot more issues with a sudden burst of smaller commands than this type of "single large" packet issue.

Yes, this is a common case and we've also seen a lot.

But there is another situation in which network flow suddenly increases without significant higher QPS, the most common reason is that users are accessing big keys, like getting a long string, or using commands like keys, scan, hget, zrange to request too many elements at once. This is common among our users. And fatlog is mainly helpful in this situation, it is a quick way to pinpoint the cause.

codecov · 2024-05-10T06:35:54Z

Codecov Report

Attention: Patch coverage is 97.91667% with 1 line in your changes missing coverage. Please review.

Project coverage is 70.16%. Comparing base (8faf278) to head (55f1a52).
Report is 301 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/slowlog.c	96.55%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #336      +/-   ##
============================================
+ Coverage     70.12%   70.16%   +0.04%     
============================================
  Files           111      111              
  Lines         60305    60339      +34     
============================================
+ Hits          42289    42338      +49     
+ Misses        18016    18001      -15

Files with missing lines	Coverage Δ
src/commands.def	`100.00% <ø> (ø)`
src/config.c	`78.39% <ø> (ø)`
src/networking.c	`85.51% <100.00%> (+0.05%)`	⬆️
src/server.c	`88.61% <100.00%> (+0.03%)`	⬆️
src/server.h	`100.00% <ø> (ø)`
src/slowlog.c	`97.89% <96.55%> (+0.42%)`	⬆️

... and 10 files with indirect coverage changes

---- 🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests
JS Bundle Analysis - Avoid shipping oversized bundles

soloestoy · 2024-05-11T03:44:35Z

I think the scenario in the top comment is very clear: "slow" and "fat" are not same. I'd love to explain it further in my understanding.

We have different types of resources like CPU, memory, network, etc. The slowlog can only record commands eat lots of CPU during the processing phase (doesn't include read/write network time), but can not record commands eat too many memory and network. For example:

run "SET key value(10 megabytes)" command would not be recored in slowlog, since when processing it the SET command only insert the value's pointer into db dict. But that command eats huge memory in query buffer and bandwidth from network. In this case, just 1000 tps can cause 10GB/s network flow.
run "GET key" command and the key's value length is 10 megabytes. The get command can eat huge memory in output buffer and bandwidth to network.

So, the fatlog can help users to locate these special cases.

BTW, the PR only record the fat output buffer, we should also record the fat input buffer (by argv_len_sum).

madolson · 2024-05-13T22:51:27Z

BTW, the PR only record the fat output buffer, we should also record the fat input buffer (by argv_len_sum).

+1, although I would imagine unexpected large outputs are a lot more common than unexpected large inputs. There are many surprising ways to see large output, like running KEYS *.

I guess I'm convinced about the value of this feature, however I don't believe that this should be solving the case where there is uniform data with large keys, such as only being able to hit 1k tps with 1mb items. That is trivial to debug by just doing a scan + memory usage + command stat review. That's an average throughput concern. I don't think we need a command for identifying obvious issues that could be root caused with a bit of effort or a little bit of tooling. If there are a few, commands generating most of the traffic.

Another thought, should we also accumulate this information on cmd_stats? That would maybe more useful for seeing changes in command output trends compared to a fat log?

madolson · 2024-04-19T22:40:17Z

src/Makefile

@@ -383,7 +383,7 @@ endif
 ENGINE_NAME=valkey
 SERVER_NAME=$(ENGINE_NAME)-server$(PROG_SUFFIX)
 ENGINE_SENTINEL_NAME=$(ENGINE_NAME)-sentinel$(PROG_SUFFIX)
-ENGINE_SERVER_OBJ=threads_mngr.o adlist.o quicklist.o ae.o anet.o dict.o kvstore.o server.o sds.o zmalloc.o lzf_c.o lzf_d.o pqsort.o zipmap.o sha1.o ziplist.o release.o networking.o util.o object.o db.o replication.o rdb.o t_string.o t_list.o t_set.o t_zset.o t_hash.o config.o aof.o pubsub.o multi.o debug.o sort.o intset.o syncio.o cluster.o cluster_legacy.o crc16.o endianconv.o slowlog.o eval.o bio.o rio.o rand.o memtest.o syscheck.o crcspeed.o crc64.o bitops.o sentinel.o notify.o setproctitle.o blocked.o hyperloglog.o latency.o sparkline.o valkey-check-rdb.o valkey-check-aof.o geo.o lazyfree.o module.o evict.o expire.o geohash.o geohash_helper.o childinfo.o defrag.o siphash.o rax.o t_stream.o listpack.o localtime.o lolwut.o lolwut5.o lolwut6.o acl.o tracking.o socket.o tls.o sha256.o timeout.o setcpuaffinity.o monotonic.o mt19937-64.o resp_parser.o call_reply.o script_lua.o script.o functions.o function_lua.o commands.o strl.o connection.o unix.o logreqres.o
+ENGINE_SERVER_OBJ=threads_mngr.o adlist.o quicklist.o ae.o anet.o dict.o kvstore.o server.o sds.o zmalloc.o lzf_c.o lzf_d.o pqsort.o zipmap.o sha1.o ziplist.o release.o networking.o util.o object.o db.o replication.o rdb.o t_string.o t_list.o t_set.o t_zset.o t_hash.o config.o aof.o pubsub.o multi.o debug.o sort.o intset.o syncio.o cluster.o cluster_legacy.o crc16.o endianconv.o heavyloadlog.o eval.o bio.o rio.o rand.o memtest.o syscheck.o crcspeed.o crc64.o bitops.o sentinel.o notify.o setproctitle.o blocked.o hyperloglog.o latency.o sparkline.o valkey-check-rdb.o valkey-check-aof.o geo.o lazyfree.o module.o evict.o expire.o geohash.o geohash_helper.o childinfo.o defrag.o siphash.o rax.o t_stream.o listpack.o localtime.o lolwut.o lolwut5.o lolwut6.o acl.o tracking.o socket.o tls.o sha256.o timeout.o setcpuaffinity.o monotonic.o mt19937-64.o resp_parser.o call_reply.o script_lua.o script.o functions.o function_lua.o commands.o strl.o connection.o unix.o logreqres.o


Even assuming we accept this, I don't really like the idea of renaming the slowlog. I think the current naming of "slow" is probably sufficient.

madolson · 2024-05-13T22:33:26Z

src/commands/slowlog-get.json

@@ -6,7 +6,7 @@
        "since": "2.2.12",
        "arity": -2,
        "container": "SLOWLOG",
-        "function": "slowlogCommand",
+        "function": "heavyLoadLogCommand",


My preference was to avoid large container commands. I would rather have slowlogCommand call some shared function then have this call a container function.

madolson · 2024-05-13T22:34:14Z

src/heavyloadlog.c

+ * Similarly, fatlog remembers the latest N queries that has a response
+ * larger than K bytes.
+ *
+ * The size of the response to reach to be logged in the fat lof is set


Suggested change

* The size of the response to reach to be logged in the fat lof is set

* The size of the response to reach to be logged in the fat log is set

madolson · 2024-05-13T22:37:17Z

src/module.c

@@ -1156,6 +1156,7 @@ int64_t commandFlagsFromString(char *s) {
        else if (!strcasecmp(t,"allow-stale")) flags |= CMD_STALE;
        else if (!strcasecmp(t,"no-monitor")) flags |= CMD_SKIP_MONITOR;
        else if (!strcasecmp(t,"no-slowlog")) flags |= CMD_SKIP_SLOWLOG;
+        else if (!strcasecmp(t,"no-fatlog")) flags |= CMD_SKIP_FATLOG;


What would the use case be for logging to the slowlog but not the fatlog (or vice versa)? Today it's only set in exec, since we want the underlying container command to show up instead of the exec, which I imagine is also true here. I almost think we have one flag, no-slowlog that implies skipping both.

madolson · 2024-05-13T22:43:56Z

src/config.c

@@ -3230,6 +3231,7 @@ standardConfig static_configs[] = {
    createLongLongConfig("proto-max-bulk-len", NULL, DEBUG_CONFIG | MODIFIABLE_CONFIG, 1024*1024, LONG_MAX, server.proto_max_bulk_len, 512ll*1024*1024, MEMORY_CONFIG, NULL, NULL), /* Bulk request max size */
    createLongLongConfig("stream-node-max-entries", NULL, MODIFIABLE_CONFIG, 0, LLONG_MAX, server.stream_node_max_entries, 100, INTEGER_CONFIG, NULL, NULL),
    createLongLongConfig("repl-backlog-size", NULL, MODIFIABLE_CONFIG, 1, LLONG_MAX, server.repl_backlog_size, 1024*1024, MEMORY_CONFIG, NULL, updateReplBacklogSize), /* Default: 1mb */
+    createLongLongConfig("fatlog-log-bigger-than", NULL, MODIFIABLE_CONFIG, -1, LLONG_MAX, server.fatlog_log_bigger_than, 16*1024, INTEGER_CONFIG, NULL, NULL),


Suggested change

createLongLongConfig("fatlog-log-bigger-than", NULL, MODIFIABLE_CONFIG, -1, LLONG_MAX, server.fatlog_log_bigger_than, 16*1024, INTEGER_CONFIG, NULL, NULL),

createLongLongConfig("fatlog-log-bigger-than", NULL, MODIFIABLE_CONFIG, -1, LLONG_MAX, server.fatlog_log_bigger_than, 16*1024, MEMORY_CONFIG, NULL, NULL),

Any reason we wouldn't let this be a memory config, so you could set it to 1mb or something?

16kb also seems way too small, some historical data was have shows that isn't even in the P99 of data size. I would say at least 1mb.

madolson · 2024-05-13T22:57:25Z

src/commands/fatlog-get.json

@@ -0,0 +1,69 @@
+{


Although I understand the rational behind a new command, a fat command is logically different from a slow command, I would also like us to more seriously consider just adding one new field to the slowlog, the output/input bytes, and then having both get logged to the same place. It's not a perfect fit, but then users can just look in "one" place for bad behavior. It also allows the two logs to perhaps explain each other, since a large output can cause a slow command. We could extend SLOWLOG GET [count] [TYPE FAT|SLOW] to only show entries of the specific type.

I'm thinking about it more from a simpler way to explain it to end users.

This does make things easier than adding a new series of commands, if we can ignore that changes in format of command reply may cause backwards compatibility problem.
But a big packet in query or response does not necessarily mean it will be slow. I'm not sure if we should do this.

madolson · 2024-05-27T14:51:24Z

Discussed this in a separate meeting. Everyone is generally aligned about the idea, but we need to refine the name and details.

CharlesChen888 · 2024-05-28T08:02:19Z

but we need to refine the name and details.

I will try to make some time for it in this week.

hwware · 2024-05-30T18:02:30Z

I propose one command name: BigValueLog.

From above comments, this command usually is used to record the big value for one key. It is different from the slowlog, which goal is to record the entry exceeded a specified execution time. Sometimes due to cpu or disk higher usage, one entry could be recorded in slowlog, but not in fatlog (BigValueLog).

Thus, I do not think adding one more argument in slowlog as SLOWLOG GET [count] [TYPE FAT|SLOW] is a good idea. I still prefer to add a new command in Valkey.

Or if we want to extend current slowlog function, I prefer the command format as: SLOWLOG GET [count] [FAT|SLOW|ALL]
Then it will cover the case where a key response with slow and fat condition.

CharlesChen888 · 2024-06-05T01:45:23Z

From above comments, this command usually is used to record the big value for one key. It is different from the slowlog, which goal is to record the entry exceeded a specified execution time. Sometimes due to cpu or disk higher usage, one entry could be recorded in slowlog, but not in fatlog (BigValueLog).

Yes, this is why I prefer to create a new command, rather than adding new field to SLOWLOG.

I propose one command name: BigValueLog.

Or BigPacketLog? I am fine with all three names.

hwware · 2024-06-26T14:49:13Z

src/commands/fatlog-help.json

@@ -0,0 +1,22 @@
+{
+    "HELP": {
+        "summary": "Show helpful text about the different subcommands",


Suggested change

"summary": "Show helpful text about the different subcommands",

"summary": "Returns helpful text about the different subcommands",

hwware · 2024-06-26T14:58:29Z

src/config.c

@@ -3156,6 +3156,7 @@ standardConfig static_configs[] = {
    createULongConfig("active-defrag-max-scan-fields", NULL, MODIFIABLE_CONFIG, 1, LONG_MAX, server.active_defrag_max_scan_fields, 1000, INTEGER_CONFIG, NULL, NULL), /* Default: keys with more than 1000 fields will be processed separately */
    createULongConfig("slowlog-max-len", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.slowlog_max_len, 128, INTEGER_CONFIG, NULL, NULL),
    createULongConfig("acllog-max-len", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.acllog_max_len, 128, INTEGER_CONFIG, NULL, NULL),
+    createULongConfig("fatlog-max-len", NULL, MODIFIABLE_CONFIG, 0, LONG_MAX, server.fatlog_max_len, 128, INTEGER_CONFIG, NULL, NULL),


We should add these 2 new config parameter in valkey.conf and describe details, pls reference the Slow Log section.

hwware · 2024-06-26T15:11:38Z

I go through the PR, most part it looks good to me, But the biggest concern for me is hard to maintenance in the future. Slowlog and Fatlog are 2 different concepts, but in this PR, they share some variables and functions, but name is still only point to slowlog (for example: slowlogCreateEntry, slowlogCommand, slowlogEntry). I worry about it takes longer time to understand the codes several year later.

How do you think @valkey-io/core-team

zuiderkwast

LGTM. Just some nits.

Actually I'm OK with the name "FATLOG". So far I didn't hear a better suggestion.

In Wikipedia Fat (disambiguation) I find the following Fat abstract concepts:

Fat address
Fat binary, a multi-platform (executable) file in computing
Fat hex
Fat link, a multi-tailed hyperlink
Fat node
Fat object (disambiguation)
Fat pointer, a special type of pointer in some programming languages
Fat shellcode

Thus, using the word "fat" for abstract concepts in programming is nothing new. I think it is the right name for this concept.

zuiderkwast · 2024-06-26T17:22:21Z

src/server.c

+/* Log the last command a client executed into the fatlog. */
+void fatlogPushCurrentCommand(client *c, struct serverCommand *cmd, size_t size) {
+    /* Some commands may contain sensitive data that should not be available in the fatlog. */
+    if (cmd->flags & CMD_SKIP_SLOWLOG) return;


Nit about the comment: The only command marked as SKIP_SLOWLOG is EXEC (if I grep correctly), so don't think this is about sensitive data. it can be that EXEC does all the work of the commands in the transaction, so EXEC takes the time and space of these commands. It would be misleading to just report "EXEC" in the slowlog or fatlog. To make it useful, we would need to store the whole transaction in the slowlog/fatlog entry. Maybe just don't mention "sensitive data"?

Suggestion:

/* Some commands should not be available in the slowlog and the fatlog. */

You are right. Maybe this is originally meant for ACL commands, but they already redacted the "sensitive data".

src/slowlog.c

PingXie · 2024-06-27T04:52:54Z

This new metric makes sense to me on a high level but I am very much not a fan of this new command name. I also wonder if it makes sense to fold this into the slowlog command:

SLOWLOG GET gives us the default behavior
SLOWLOG GET BY TIME == SLOWLOG GET
SLOWLOG GET BY SIZE == this PR
SLOWLOG GET BY XXX for future proofing

Thoughts?

soloestoy · 2024-06-27T06:22:45Z

Adding type to SLOWLOG is a bit wired to me. One is a time dimension and the other is a spatial dimension, they focus on different areas, so I don't think it's a good approach.

I would not recommend using BigValueLog. Here we are concerned with network traffic, such as an MSET composed of multiple small key-value pairs creating a large input, or an MGET that has many small values making up a large output.

I think FATLOG is the most suitable name. It conveys the meaning accurately and is also concise enough. Moreover, as we can see from the examples provided by @zuiderkwast , it is also applied in other fields and carries a similar meaning to what is intended in this PR.

However, if everyone feels that this name is inappropriate, BigPacketLog would be an acceptable alternative to me.

CharlesChen888 · 2024-06-27T08:59:40Z

Slowlog and Fatlog are 2 different concepts, but in this PR, they share some variables and functions, but name is still only point to slowlog

We can just rename the common code to make it look like a common lib, then make slowlog and fatlog to become two interfaces of this lib. (This is actually my original idea.)

CharlesChen888 · 2024-06-27T11:24:14Z

I just realized, and maybe this should be noted ... when logging fat input queries, the packet length is c->argv_len_sum, which does not include the length of RESP protocol characters, but when logging fat output reponses, the length used is c->cmd_reply_length, which is the total length of response content and protocol characters.

hwware · 2024-06-27T14:30:09Z

Adding type to SLOWLOG is a bit wired to me. One is a time dimension and the other is a spatial dimension, they focus on different areas, so I don't think it's a good approach.

I would not recommend using BigValueLog. Here we are concerned with network traffic, such as an MSET composed of multiple small key-value pairs creating a large input, or an MGET that has many small values making up a large output.

I think FATLOG is the most suitable name. It conveys the meaning accurately and is also concise enough. Moreover, as we can see from the examples provided by @zuiderkwast , it is also applied in other fields and carries a similar meaning to what is intended in this PR.

However, if everyone feels that this name is inappropriate, BigPacketLog would be an acceptable alternative to me.

Agree with you on not adding type to SLOWLOG. Since Fat abstract concepts are already applied in other fields, I think
FatLog is good enough for our Valkey. I vote keeping it.

hwware · 2024-06-27T14:31:38Z

Slowlog and Fatlog are 2 different concepts, but in this PR, they share some variables and functions, but name is still only point to slowlog

We can just rename the common code to make it look like a common lib, then make slowlog and fatlog to become two interfaces of this lib. (This is actually my original idea.)

Thanks, pls update to a common interface and make the Slowlog and Fatlog implementation separated. I think it is easier to maintain later.

Signed-off-by: Chen Tianjie <[email protected]>

zuiderkwast · 2024-07-02T17:28:57Z

We discussed this in the core team meeting. The only remaining issue to agree about is the name. Some people don't like FAT because it has a negative connotation.

@valkey-io/core-team Please comment your favorite name and some alternatives you can accept. Then we'll see if there is a winner.

Remember: The log is not about the size of keys. It's about the command and response sizes, the client traffic.

Suggestions:

FATLOG
LARGELOG
BIGLOG
BIGPACKETLOG
HUGELOG
THICKLOG
HEAVYLOG (heavy traffic?)
~~HYPERLOGLOG~~
Other suggestions?

hwware · 2024-07-02T17:33:36Z

We discussed this in the core team meeting. The only remaining issue to agree about is the name. Some people don't like FAT because it has a negative connotation.

@valkey-io/core-team Please comment you're favorite name and some alternatives you can accept. Then we'll see if there is a winner.

Remember: The log is not about the size of keys. It's about the command and response sizes, the client traffic.

Suggestions:

FATLOG

LARGELOG

BIGLOG

BIGPACKETLOG

HUGELOG

THICKLOG

HEAVYLOG (heavy traffic?)

~~HYPERLOGLOG~~

Other suggestions?

I vote No1. LargeLog No2. BigLog No No 3.

PingXie · 2024-07-02T18:10:09Z

+1 on "LARGELOG"

Signed-off-by: Chen Tianjie <[email protected]>

enjoy-binbin · 2024-07-03T11:16:58Z

+1 on "LARGELOG"

soloestoy · 2024-07-03T11:29:25Z

Adjectives like "big," "huge," or "large" are all acceptable, but we need to clarify what exactly they are describing. Is it a big key, a huge value, or a large command? Otherwise, it can easily be misconstrued. We need to describe clearly what it is recording.

Terms like "bigPacketLog" or "heavyTrafficLog" would be more appropriate, if these terms are too long, then "heavylog" could also work. "Heavy" can be used to describe traffic, no one would say "big traffic," "large traffic," or "huge traffic," right?

zuiderkwast · 2024-07-03T13:13:08Z

I prefer that we use different words for the size of a key and the size of a command/response.

Another area of possible confusion could have been keys on the top-level vs keys in hashes, but here we have the terminology "key" for the top-level and "field" for hash fields. This terminology helps when we talk about these, so we don't have to explain every time what we're talking about.

I think we can reserve the word "big" for keys and use another word for commands and responses. I'm leaning towards "heavy" now, since it's about traffic. Let's consider the terminology "heavy command" and "heavy response". 🤔 .... I like it.

I did a little grep in docs and code and I find that "heavy" is almost exclusively for traffic and IO ("a write-heavy application", "heavy disk I/O"), while "large" is used for many different things ("large databases", "a very large COUNT option", etc.).

+1 on HEAVYLOG

kyle-yh-kim · 2024-07-04T03:23:27Z

src/networking.c

+    /* Record query length for fatlog. */
+    c->cmd_query_length += c->qb_pos - qb_pos_start;


Thanks for sharing this commit. The c->qb_pos tracking is definitely a feasible option.
I left some edge-case considerations on the other thread, and will leave the link here for reference; #720 (comment).

We've internally considered using c->qb_pos as well. While tracking c->qb_pos also works, it comes with a few edge cases, namely;

c->qb_pos will not start from 0 for some pipeline cases. This means, we must also track the starting position of the qb_pos, call it qb_pos_start. I see that this case is already covered by your commit, which is good.

Optimization 1. Large object creation. For the case where a non-primary client's buffer contains just the bulk element, the SDS ownership is simply moved from c->querybuf to robj, rather than copying the entire SDS. In such case, c->qb_pos do not step forward, even though the bytes have actually been read. We must manually account for this, otherwise this ingress bytes is forever lost.

Optimization 2. Early c->qb_pos reset. c->qb_pos could be prematurely reset to 0, such that its processing could benefit from the above optimization (creating large object from the 0th index of c->querybuf). In conjunction with the first edge case (where we start from a non-zero qb_pos_start), simply incrementing c->qb_pos - qb_pos_start can yield a negative ingress bytes result.

When comparing both implementations side-by-side, it became clear that c->qb_pos including all its edge case handlings - was harder to comprehend when compared to the c->argv_len_sum variant. Not to mention, c->qb_pos accounting will need to change upon any newly introduced query processing optimizations, and/or changes to the existing optimizations mentioned above.

Ultimately, I find c->argv_len_sum to be a stronger contract. Either subroutines (processInlineBuffer() or processMultibulkBuffer()) hold a guarantee of fully populated c->argc/argv/argv_len_sum before command execution. Any existing or future c->qb_pos optimizations aren't concerned, as long as c->argc/argv/argv_len_sum are correctly filled. I also find the implementation more intuitive and easier to maintain.

But no strong opinions on this, it's good that we are having this discussion, and I'm fine to swing both ways. Curious to hear the core-team's thoughts as well.

During the latest core meeting (July 15th 2024), we concluded to move forward with the currently upheld c->argv_len_sum approach. I will now resolve this thread.

@CharlesChen888
Given this decision, I would first wait for the merger of per-slot network-bytes-in PR, which introduces c->net_input_bytes_curr_cmd based on c->argv_len_sum calculation. After its merger, your PR may refer to c->net_input_bytes_curr_cmd as a heuristic for assessing fatlogs. With that said, let me know if you need any help from my end. If there's a better way to collaborate, I'm open for it.

madolson · 2024-07-08T14:48:39Z

+1 to Largelog.

Alternative being considered is HEAVYTRAFFICLOG.

soloestoy · 2024-07-24T03:49:24Z

After a core team meeting, we decided adding a new command COMMANDLOG with subcommands HEAVYTRAFFIC and SLOW, and then slowlog.c can be renamed to a common commandlog.c.

CharlesChen888 · 2024-11-23T15:33:28Z

I must apologize for forgetting about this PR. Many things have happened in the past few months and now I finally have the time and energy to return to the Valkey community. I was thinking about finishing this new feature this morning, but it turns out that @soloestoy has taken over the work. So I think I will just close this PR and we continue to finish this feature in the new one.

soloestoy · 2024-11-25T01:45:45Z

I must apologize for forgetting about this PR. Many things have happened in the past few months and now I finally have the time and energy to return to the Valkey community. I was thinking about finishing this new feature this morning, but it turns out that @soloestoy has taken over the work. So I think I will just close this PR and we continue to finish this feature in the new one.

welcome back @CharlesChen888 : )

CharlesChen888 force-pushed the add-fat-log branch from 4fd206a to c3727a7 Compare April 19, 2024 09:54

madolson added the major-decision-pending Major decision pending by TSC team label Apr 21, 2024

CharlesChen888 force-pushed the add-fat-log branch from 428f06b to f894a52 Compare May 10, 2024 09:48

madolson reviewed May 13, 2024

View reviewed changes

CharlesChen888 force-pushed the add-fat-log branch from 2a80b69 to 9e7f529 Compare May 29, 2024 10:01

CharlesChen888 force-pushed the add-fat-log branch from 1aa1d56 to a0cf493 Compare June 5, 2024 01:52

CharlesChen888 requested review from zuiderkwast, madolson and hwware June 25, 2024 05:57

hwware reviewed Jun 26, 2024

View reviewed changes

zuiderkwast approved these changes Jun 26, 2024

View reviewed changes

CharlesChen888 added 6 commits July 1, 2024 16:21

Implement fat log.

6548769

Signed-off-by: Chen Tianjie <[email protected]>

Rename file and correct typo.

e4eb38b

Signed-off-by: Chen Tianjie <[email protected]>

Rename. Fix details.

b273214

Signed-off-by: Chen Tianjie <[email protected]>

Run clang format.

a3ee175

Signed-off-by: Chen Tianjie <[email protected]>

Fatlog count input big packet as well.

e1ab7d9

Signed-off-by: Chen Tianjie <[email protected]>

Minor fix in comments.

e928f27

Signed-off-by: Chen Tianjie <[email protected]>

CharlesChen888 force-pushed the add-fat-log branch from d004306 to e928f27 Compare July 1, 2024 08:29

Run clang format.

a791ac0

Signed-off-by: Chen Tianjie <[email protected]>

zuiderkwast mentioned this pull request Jul 2, 2024

Add network-bytes-in and network-bytes-out metric support under CLUSTER SLOT-STATS command (#20) #720

Merged

CharlesChen888 added 2 commits July 3, 2024 09:57

Record whole query packer instead of just arguments.

204dfba

Signed-off-by: Chen Tianjie <[email protected]>

Merge branch 'unstable' into add-fat-log.

55f1a52

Signed-off-by: Chen Tianjie <[email protected]>

kyle-yh-kim reviewed Jul 4, 2024

View reviewed changes

madolson changed the title ~~Introduce fatlog to record big response packet.~~ Introduce commandlog heavytraffi to record big response packet. Jul 29, 2024

madolson changed the title ~~Introduce commandlog heavytraffi to record big response packet.~~ Introduce commandlog heavytraffic to record big response packet. Jul 29, 2024

soloestoy mentioned this pull request Nov 12, 2024

Feature COMMANDLOG to record slow execution and large request/reply #1294

Open

CharlesChen888 closed this Nov 23, 2024

	* The size of the response to reach to be logged in the fat lof is set
	* The size of the response to reach to be logged in the fat log is set

	createLongLongConfig("fatlog-log-bigger-than", NULL, MODIFIABLE_CONFIG, -1, LLONG_MAX, server.fatlog_log_bigger_than, 16*1024, INTEGER_CONFIG, NULL, NULL),
	createLongLongConfig("fatlog-log-bigger-than", NULL, MODIFIABLE_CONFIG, -1, LLONG_MAX, server.fatlog_log_bigger_than, 16*1024, MEMORY_CONFIG, NULL, NULL),

	"summary": "Show helpful text about the different subcommands",
	"summary": "Returns helpful text about the different subcommands",

		/* Record query length for fatlog. */
		c->cmd_query_length += c->qb_pos - qb_pos_start;

Introduce commandlog heavytraffic to record big response packet. #336

Introduce commandlog heavytraffic to record big response packet. #336

Conversation

CharlesChen888 commented Apr 19, 2024 • edited Loading

Why fat log?

Alternative solution?

Detail introduction

About implementation

madolson commented Apr 21, 2024

CharlesChen888 commented Apr 22, 2024 • edited Loading

hwware commented Apr 24, 2024 • edited Loading

CharlesChen888 commented Apr 25, 2024

madolson commented May 6, 2024

CharlesChen888 commented May 10, 2024

codecov bot commented May 10, 2024 • edited Loading

Codecov Report

soloestoy commented May 11, 2024 • edited Loading

madolson commented May 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CharlesChen888 May 29, 2024 • edited Loading

Choose a reason for hiding this comment

madolson commented May 27, 2024 • edited Loading

CharlesChen888 commented May 28, 2024

hwware commented May 30, 2024 • edited Loading

CharlesChen888 commented Jun 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hwware commented Jun 26, 2024 • edited Loading

zuiderkwast left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PingXie commented Jun 27, 2024

soloestoy commented Jun 27, 2024 • edited Loading

CharlesChen888 commented Jun 27, 2024

CharlesChen888 commented Jun 27, 2024

hwware commented Jun 27, 2024

hwware commented Jun 27, 2024

zuiderkwast commented Jul 2, 2024 • edited Loading

hwware commented Jul 2, 2024

PingXie commented Jul 2, 2024

enjoy-binbin commented Jul 3, 2024

soloestoy commented Jul 3, 2024 • edited Loading

zuiderkwast commented Jul 3, 2024 • edited Loading

kyle-yh-kim Jul 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madolson commented Jul 8, 2024 • edited Loading

soloestoy commented Jul 24, 2024

CharlesChen888 commented Nov 23, 2024

soloestoy commented Nov 25, 2024

Introduce `commandlog heavytraffic` to record big response packet. #336

Introduce `commandlog heavytraffic` to record big response packet. #336

CharlesChen888 commented Apr 19, 2024 •

edited

Loading

CharlesChen888 commented Apr 22, 2024 •

edited

Loading

hwware commented Apr 24, 2024 •

edited

Loading

codecov bot commented May 10, 2024 •

edited

Loading

soloestoy commented May 11, 2024 •

edited

Loading

CharlesChen888 May 29, 2024 •

edited

Loading

madolson commented May 27, 2024 •

edited

Loading

hwware commented May 30, 2024 •

edited

Loading

hwware commented Jun 26, 2024 •

edited

Loading

soloestoy commented Jun 27, 2024 •

edited

Loading

zuiderkwast commented Jul 2, 2024 •

edited

Loading

soloestoy commented Jul 3, 2024 •

edited

Loading

zuiderkwast commented Jul 3, 2024 •

edited

Loading

kyle-yh-kim Jul 4, 2024 •

edited

Loading

madolson commented Jul 8, 2024 •

edited

Loading