Limit tracking custom errors (e.g. from LUA) while allowing non custom errors to be tracked normally #500

KarthikSubbarao · 2024-05-15T08:59:48Z

Implementing the change proposed here: #487

In this PR, we prevent tracking new custom error messages (e.g. LUA) if the number of error messages (in the errors RAX) is greater than 128. Instead, we will track any additional custom error prefix in a new counter: errorstat_ERRORSTATS_OVERFLOW and if any non-custom flagged errors (e.g. MOVED / CLUSTERDOWN) occur, they will continue to be tracked as usual.

This will address the issue of spammed error messages / memory usage of the errors RAX. Additionally, we will not have to execute CONFIG RESETSTAT to restore error stats functionality because normal error messages continue to be tracked.

Example:

# Errorstats
.
.
.
errorstat_127:count=2
errorstat_128:count=2
errorstat_ERR:count=1
errorstat_ERRORSTATS_OVERFLOW:count=2

KarthikSubbarao · 2024-05-15T09:25:25Z

(Force pushed because this was recommended by the bot here since I did not include the commit sign off originally. Also because it is not reviewed yet)

src/networking.c

codecov · 2024-05-18T13:29:03Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.25%. Comparing base (752b6ee) to head (08e97ba).
Report is 19 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #500      +/-   ##
============================================
+ Coverage     70.20%   70.25%   +0.04%     
============================================
  Files           111      112       +1     
  Lines         60242    60587     +345     
============================================
+ Hits          42295    42567     +272     
- Misses        17947    18020      +73

Files	Coverage Δ
src/networking.c	`88.75% <100.00%> (+3.38%)`	⬆️
src/script_lua.c	`90.15% <100.00%> (+0.01%)`	⬆️
src/server.c	`88.47% <ø> (-0.10%)`	⬇️
src/server.h	`100.00% <ø> (ø)`

... and 33 files with indirect coverage changes

ranshid · 2024-05-19T10:26:23Z

@KarthikSubbarao Overall LGTM. I do think we need to add more tests so that future changes will not introduce degradation.
For example lets include function and redis.call cases ARE getting into errorStats.

srgsanky

lgtm. My comments are minor.

src/server.h

tests/unit/info.tcl

src/script_lua.c

src/networking.c

KarthikSubbarao · 2024-05-20T00:28:47Z

@KarthikSubbarao Overall LGTM. I do think we need to add more tests so that future changes will not introduce degradation. For example lets include function and redis.call cases ARE getting into errorStats.

@ranshid - When functions are used and contain LUA code with server.error_reply with custom error messages, they will still be caught by this new logic when we are past the 128 limit.
If the error is not from LUA (e.g. syntax error in server.call), it will continue to be tracked

Did you mean add test cases where functions (with lua) are tracked under errorstat_LUA_ERRORSTATS_OVERFLOW when the errors RAX is over the limit?

Example:

for ((i=1; i<=128; i++))                                                            
do
    ./valkey-cli EVAL "return server.error_reply('$i a');" 0
done

% cat customerr.lua 
#!lua name=mylib
redis.register_function(
  'custom_error',
  function() return server.error_reply('customerror 0') end
)

cat customerr.lua | ./valkey-cli -x FUNCTION LOAD REPLACE

% ./valkey-cli                                                                        
127.0.0.1:6379> fcall custom_error 0
(error) customerror 0
127.0.0.1:6379> info errorstats
errorstat_1:count=1
.
.
.
errorstat_128:count=1
errorstat_LUA_ERRORSTATS_OVERFLOW:count=1

ranshid · 2024-05-20T05:59:25Z

@KarthikSubbarao Overall LGTM. I do think we need to add more tests so that future changes will not introduce degradation. For example lets include function and redis.call cases ARE getting into errorStats.

@ranshid - When functions are used and contain LUA code with server.error_reply with custom error messages, they will still be caught by this new logic when we are past the 128 limit. If the error is not from LUA (e.g. syntax error in server.call), it will continue to be tracked

Did you mean add test cases where functions (with lua) are tracked under errorstat_LUA_ERRORSTATS_OVERFLOW when the errors RAX is over the limit?

Example:
for ((i=1; i<=128; i++))                                                            
do
    ./valkey-cli EVAL "return server.error_reply('$i a');" 0
done
% cat customerr.lua 
#!lua name=mylib
redis.register_function(
  'custom_error',
  function() return server.error_reply('customerror 0') end
)
cat customerr.lua | ./valkey-cli -x FUNCTION LOAD REPLACE 
% ./valkey-cli                                                                        
127.0.0.1:6379> fcall custom_error 0
(error) customerror 0
127.0.0.1:6379> info errorstats
errorstat_1:count=1
.
.
.
errorstat_128:count=1
errorstat_LUA_ERRORSTATS_OVERFLOW:count=1

@KarthikSubbarao I think this is somewhat problematic. Functions are more like modules IMO and I think we should allow function errors to overflow. I think maybe we can flag the client (or check if we are in the context of eval/evalsha) in order to enforce the overflow?

KarthikSubbarao · 2024-05-20T15:30:06Z

@KarthikSubbarao I think this is somewhat problematic. Functions are more like modules IMO and I think we should allow function errors to overflow. I think maybe we can flag the client (or check if we are in the context of eval/evalsha) in order to enforce the overflow?

Functions are still using LUA and are using the same APIs such as server.error_reply to reply with custom errors. Because of this, we can still get into the spamming error section output with functions - as we do with EVAL/EVALSHA.

To handle this, when a server exceeds the limit, when functions are used with additional custom errors, they are tracked under errorstat_LUA_ERRORSTATS_OVERFLOW.

IMO, this behavior makes sense - but I am also curious to hear from others

…es while allowing non LUA errors to function as usual Signed-off-by: Karthik Subbarao <[email protected]>

Signed-off-by: Karthik Subbarao <[email protected]>

KarthikSubbarao · 2024-05-20T21:05:08Z

Sorry for the force push. I did not include the sign off on the previous commit and the DCO check required this

Signed-off-by: KarthikSubbarao <[email protected]>

madolson · 2024-06-25T02:40:57Z

@valkey-io/core-team Since @PingXie mentioned in the other thread about this, I thought it would be good to try to get consensus on this change. The tl;dr is that in Redis it was possible to spam the error_stat log with custom errors generated from lua scripts, because you can return arbitrary errors from lua scripts. @enjoy-binbin implemented a work around that would clear the errostat radix tree if it's size got above a certain threshold. In order for a user to resume seeing errors, they would need to to call config resetstat. However, this is an admin command, and restricted on many systems, so a user might not be able to get out of this state easily. So instead this change stops reporting errors from lua scripts above a certain threshold.

Ping suggested:

I guess that depends on the definition of "lua errors". If they are interpreted as "custom errors" then agreed but do we see value in interpreting them as the origin of the errors? Meaning any errors, both custom and normal, generated by any scripts go to its own RAX. If we go down this path, we would have to create a dedicated errorstat section just for scripts ( errorstat_script_*?). Since this is already a breaking change, I think it would be worthwhile expanding the discussion a bit.

I think this is important to get right for 8, so let's settle this here if possible.

PingXie · 2024-06-25T05:04:36Z

@valkey-io/core-team Since @PingXie mentioned in the other thread about this, I thought it would be good to try to get consensus on this change. The tl;dr is that in Redis it was possible to spam the error_stat log with custom errors generated from lua scripts, because you can return arbitrary errors from lua scripts. @enjoy-binbin implemented a work around that would clear the errostat radix tree if it's size got above a certain threshold. In order for a user to resume seeing errors, they would need to to call config resetstat. However, this is an admin command, and restricted on many systems, so a user might not be able to get out of this state easily. So instead this change stops reporting errors from lua scripts above a certain threshold.

Ping suggested:
I guess that depends on the definition of "lua errors". If they are interpreted as "custom errors" then agreed but do we see value in interpreting them as the origin of the errors? Meaning any errors, both custom and normal, generated by any scripts go to its own RAX. If we go down this path, we would have to create a dedicated errorstat section just for scripts ( errorstat_script_*?). Since this is already a breaking change, I think it would be worthwhile expanding the discussion a bit.
I think this is important to get right for 8, so let's settle this here if possible.

I got convinced by you on errorstat_script_* being a more breaking change :-).

Signed-off-by: KarthikSubbarao <[email protected]>

madolson · 2024-07-01T21:07:27Z

From our meeting this morning, we are directionally approved to update the behavior here but we'll decide locally in the PR for whatever makes sense.

Signed-off-by: KarthikSubbarao <[email protected]>

zuiderkwast

LGTM. A few minor suggestions/questions.

src/networking.c

src/server.h

madolson · 2024-07-09T01:26:35Z

Sorry for the late consensus binding, binbin mentioned in the other thread that he was good with this approach and I just had a small comment ontop of Viktors. Once they are addressed this should be good to merge.

This reverts commit 05e946f. Signed-off-by: KarthikSubbarao <[email protected]>

… + update overflow error prefix name Signed-off-by: KarthikSubbarao <[email protected]>

Signed-off-by: KarthikSubbarao <[email protected]>

zuiderkwast

Just one nit, then I'm happy.

src/networking.c

Signed-off-by: KarthikSubbarao <[email protected]>

KarthikSubbarao · 2024-07-10T06:55:32Z

This run might have failed on a flaky test: https://github.com/valkey-io/valkey/actions/runs/9868728468/job/27251202927?pr=500

I re-ran it locally multiple times and the tests all passed

src/server.h

Signed-off-by: KarthikSubbarao <[email protected]>

src/server.h

enjoy-binbin

thanks, LGTM.

zuiderkwast

Great, thanks.

src/server.h

Signed-off-by: Madelyn Olson <[email protected]>

KarthikSubbarao force-pushed the lua branch from 14c9484 to daae0d8 Compare May 15, 2024 09:24

ranshid reviewed May 16, 2024

View reviewed changes

src/networking.c Outdated Show resolved Hide resolved

srgsanky reviewed May 19, 2024

View reviewed changes

src/server.h Outdated Show resolved Hide resolved

tests/unit/info.tcl Outdated Show resolved Hide resolved

src/script_lua.c Outdated Show resolved Hide resolved

src/networking.c Outdated Show resolved Hide resolved

KarthikSubbarao added 4 commits May 20, 2024 20:47

Add support for limiting custom LUA errors when over 128 error messag…

b6e0f19

…es while allowing non LUA errors to function as usual Signed-off-by: Karthik Subbarao <[email protected]>

Minor clean up + update documentation

c86cb62

Signed-off-by: Karthik Subbarao <[email protected]>

Rename LUA Error Stat overflow error message

bc0eaef

Signed-off-by: Karthik Subbarao <[email protected]>

Update documentation, minor refactor, additional test case

1c77f69

Signed-off-by: Karthik Subbarao <[email protected]>

KarthikSubbarao force-pushed the lua branch from 3ceea58 to 1c77f69 Compare May 20, 2024 21:02

KarthikSubbarao added 2 commits May 21, 2024 04:06

update tests

8206c4a

Signed-off-by: KarthikSubbarao <[email protected]>

Add tests for Valkey Functions

a03333b

Signed-off-by: KarthikSubbarao <[email protected]>

KarthikSubbarao requested review from srgsanky and ranshid May 22, 2024 18:21

madolson requested review from enjoy-binbin and removed request for srgsanky June 21, 2024 22:02

KarthikSubbarao force-pushed the lua branch from 25cac59 to a03333b Compare June 21, 2024 23:22

KarthikSubbarao added 2 commits June 21, 2024 16:26

Merge branch 'unstable' into lua

d09beca

Fix typo from unstable merge

6d324c8

Signed-off-by: KarthikSubbarao <[email protected]>

madolson linked an issue Jun 24, 2024 that may be closed by this pull request

[NEW] Handle spamming of custom LUA error messages in the INFO ERRORSTATS section with continued tracking of non LUA errors stats #487

Closed

madolson added the major-decision-pending Major decision pending by TSC team label Jun 25, 2024

enjoy-binbin mentioned this pull request Jul 1, 2024

[NEW] Handle spamming of custom LUA error messages in the INFO ERRORSTATS section with continued tracking of non LUA errors stats #487

Closed

madolson removed the major-decision-pending Major decision pending by TSC team label Jul 1, 2024

madolson added the major-decision-approved Major decision approved by TSC team label Jul 1, 2024

KarthikSubbarao added 2 commits July 1, 2024 12:09

Merge branch 'valkey-io:unstable' into lua

676d87c

Address some of the format suggestions

d054ab1

Signed-off-by: KarthikSubbarao <[email protected]>

clang-format-check suggestions

05e946f

Signed-off-by: KarthikSubbarao <[email protected]>

KarthikSubbarao force-pushed the lua branch from 5f1339b to 05e946f Compare July 8, 2024 17:08

zuiderkwast reviewed Jul 8, 2024

View reviewed changes

src/networking.c Outdated Show resolved Hide resolved

src/server.h Show resolved Hide resolved

madolson reviewed Jul 9, 2024

View reviewed changes

src/server.h Outdated Show resolved Hide resolved

KarthikSubbarao added 3 commits July 9, 2024 21:14

Revert "clang-format-check suggestions"

23f598d

This reverts commit 05e946f. Signed-off-by: KarthikSubbarao <[email protected]>

Allow custom errors already tracked to be incremented when past limit…

fe6e574

… + update overflow error prefix name Signed-off-by: KarthikSubbarao <[email protected]>

fmt changes

d849928

Signed-off-by: KarthikSubbarao <[email protected]>

zuiderkwast reviewed Jul 10, 2024

View reviewed changes

src/networking.c Outdated Show resolved Hide resolved

Reorder check

54ce6db

Signed-off-by: KarthikSubbarao <[email protected]>

enjoy-binbin reviewed Jul 10, 2024

View reviewed changes

src/server.h Outdated Show resolved Hide resolved

src/server.h Outdated Show resolved Hide resolved

fmt changes

85d003e

Signed-off-by: KarthikSubbarao <[email protected]>

hpatro approved these changes Jul 10, 2024

View reviewed changes

src/server.h Show resolved Hide resolved

KarthikSubbarao changed the title ~~Limit custom LUA error stats while allowing non LUA error stats to function normally~~ Limit tracking custom error prefixes (e.g. from LUA) while allowing non custom errors to be tracked normally Jul 10, 2024

KarthikSubbarao changed the title ~~Limit tracking custom error prefixes (e.g. from LUA) while allowing non custom errors to be tracked normally~~ Limit tracking custom errors (e.g. from LUA) while allowing non custom errors to be tracked normally Jul 10, 2024

enjoy-binbin approved these changes Jul 11, 2024

View reviewed changes

enjoy-binbin added release-notes This issue should get a line item in the release notes needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. labels Jul 11, 2024

zuiderkwast approved these changes Jul 11, 2024

View reviewed changes

madolson approved these changes Jul 11, 2024

View reviewed changes

madolson reviewed Jul 11, 2024

View reviewed changes

src/server.h Outdated Show resolved Hide resolved

Update src/server.h

08e97ba

Signed-off-by: Madelyn Olson <[email protected]>

madolson merged commit 418901d into valkey-io:unstable Jul 15, 2024
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit tracking custom errors (e.g. from LUA) while allowing non custom errors to be tracked normally #500

Limit tracking custom errors (e.g. from LUA) while allowing non custom errors to be tracked normally #500

KarthikSubbarao commented May 15, 2024 •

edited

Loading

KarthikSubbarao commented May 15, 2024 •

edited

Loading

codecov bot commented May 18, 2024 •

edited

Loading

ranshid commented May 19, 2024

srgsanky left a comment

KarthikSubbarao commented May 20, 2024 •

edited

Loading

ranshid commented May 20, 2024

KarthikSubbarao commented May 20, 2024 •

edited

Loading

KarthikSubbarao commented May 20, 2024

madolson commented Jun 25, 2024 •

edited by zuiderkwast

Loading

PingXie commented Jun 25, 2024

madolson commented Jul 1, 2024

zuiderkwast left a comment

madolson commented Jul 9, 2024

zuiderkwast left a comment

KarthikSubbarao commented Jul 10, 2024 •

edited

Loading

enjoy-binbin left a comment

zuiderkwast left a comment

Limit tracking custom errors (e.g. from LUA) while allowing non custom errors to be tracked normally #500

Limit tracking custom errors (e.g. from LUA) while allowing non custom errors to be tracked normally #500

Conversation

KarthikSubbarao commented May 15, 2024 • edited Loading

KarthikSubbarao commented May 15, 2024 • edited Loading

codecov bot commented May 18, 2024 • edited Loading

Codecov Report

ranshid commented May 19, 2024

srgsanky left a comment

Choose a reason for hiding this comment

KarthikSubbarao commented May 20, 2024 • edited Loading

ranshid commented May 20, 2024

KarthikSubbarao commented May 20, 2024 • edited Loading

KarthikSubbarao commented May 20, 2024

madolson commented Jun 25, 2024 • edited by zuiderkwast Loading

PingXie commented Jun 25, 2024

madolson commented Jul 1, 2024

zuiderkwast left a comment

Choose a reason for hiding this comment

madolson commented Jul 9, 2024

zuiderkwast left a comment

Choose a reason for hiding this comment

KarthikSubbarao commented Jul 10, 2024 • edited Loading

enjoy-binbin left a comment

Choose a reason for hiding this comment

zuiderkwast left a comment

Choose a reason for hiding this comment

KarthikSubbarao commented May 15, 2024 •

edited

Loading

KarthikSubbarao commented May 15, 2024 •

edited

Loading

codecov bot commented May 18, 2024 •

edited

Loading

KarthikSubbarao commented May 20, 2024 •

edited

Loading

KarthikSubbarao commented May 20, 2024 •

edited

Loading

madolson commented Jun 25, 2024 •

edited by zuiderkwast

Loading

KarthikSubbarao commented Jul 10, 2024 •

edited

Loading