Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW] Handle spamming of custom LUA error messages in the INFO ERRORSTATS section with continued tracking of non LUA errors stats #487

Closed
KarthikSubbarao opened this issue May 11, 2024 · 17 comments · Fixed by #500

Comments

@KarthikSubbarao
Copy link
Member

KarthikSubbarao commented May 11, 2024

The problem/use-case that the feature addresses

When using LUA to reply with custom error messages, the prefix of the messages be included in the INFO ERRORSTATS section as a unique entry. This can spam the INFO ALL command or any INFO command variant that includes the ERRORSTATS section.

redis/redis#13141 address this spamming of the INFO command's ERRORSTATS section and the memory usage of the errors RAX. However, it has the following drawbacks:

  1. To re-enable error stats tracking, we need to use CONFIG RESETSTAT which will clear command stat counters and error stat counters. So, we will lose visibility in both of these.
  2. When misuse is detected (> 128 error types), we prevent tracking of regular (non LUA) error types. e.g. MOVED, CLUSTERDOWN, etc.
127.0.0.1:6379> EVAL "return server.error_reply('3692896699 a');" 0
(error) 3692896699 a 
127.0.0.1:6379> EVAL "return server.error_reply('4164681858 b');" 0
(error) 4164681858 b
127.0.0.1:6379> EVAL "return server.error_reply('5558209445 c');" 0
(error) 5558209445 c
127.0.0.1:6379> EVAL "return server.error_reply('8877681120 d');" 0
(error) 8877681120 d
127.0.0.1:6379> info errorstats
# Errorstats
errorstat_3692896699:count=1
errorstat_4164681858:count=1
errorstat_5558209445:count=1
errorstat_8877681120:count=1

Description of the feature

The proposal in this issue is to prevent tracking new error messages from LUA if the number of LUA error messages is greater than 128. Instead, we will track any additional LUA error type in a new counter: errorstat_LUA and if a non-LUA error (e.g. MOVED / CLUSTERDOWN) occurs, they will continue to be tracked as usual.

This will address the issue of spammed error messages / memory usage of the errors RAX. Additionally, we will not have to execute CONFIG RESETSTAT to restore functionality - as normal error messages continue to be tracked.

Based on feedback and thoughts from this discussion, I can propose a PR to support this feature.

Alternatives you've considered

NA

@KarthikSubbarao KarthikSubbarao changed the title [NEW] Config to disable including custom error messages in the INFO ERRORSTATS section [NEW] Config to disable including custom error messages (LUA) in the INFO ERRORSTATS section May 11, 2024
@enjoy-binbin
Copy link
Member

does this redis/redis#13141 can somehow solve your problem?

@madolson
Copy link
Member

madolson commented May 12, 2024

does this redis/redis#13141 can somehow solve your problem?

I really don't understand why Oran marked that as not a breaking change, it's possible someone had more than 128 errors. I feel like he was just trying to merge stuff before the license change.

I think this should solve the issue @KarthikSubbarao brought up though.

@enjoy-binbin
Copy link
Member

I really don't understand why Oran marked that as not a breaking change, it's possible someone had more than 128 errors.

yes, that is indeed a potential breaking change. I have thought about add a new configuration item and defaulting to infinity to avoid the breaking change, but Oran does't seem to mention it and probable doesn't want to add new a configuration item for this.

I feel like he was just trying to merge stuff before the license change.

Looking at it now, it may indeed be.

@KarthikSubbarao
Copy link
Member Author

KarthikSubbarao commented May 13, 2024

does this redis/redis#13141 can somehow solve your problem?

@enjoy-binbin - (Just read the top comment) Yes, this will solve the issue of spamming INFO ERRORSTATS and reduce additional data stored on error stats.

For completeness - with redis/redis#13141, ERRORSTATS will be disabled when misuse is detected. This means non LUA errors (which do not spam) will also be disabled until we use CONFIG RESETSTAT.

Whereas, the idea proposed in this issue was to have a config to disable custom LUA error messages while continuing to allow regular Error stats to function normally.

But the main issue we wanted to address is misuse (spamming), so this should work. Thank you

Whereas, the idea proposed in this issue was to have a config to disable custom LUA error messages while continuing to allow regular Error stats to function normally.

If we think this is useful, we can consider making this improvement - by adding onto the existing logic

@srgsanky
Copy link
Contributor

srgsanky commented May 14, 2024

Resetting all the error count will also lose visibility into other errors like MOVED etc that can help with debugging issues in the cluster.

Is it possible to record all custom errors from lua in a separate RAX (based on scriptIsRunning() and if the error originated from luaRedisErrorReplyCommand) and show the custom errors only when using INFO ERRORSTATS explicitly?

@KarthikSubbarao
Copy link
Member Author

KarthikSubbarao commented May 14, 2024

Resetting all the error count will also lose visibility into other errors like MOVED etc that can help with debugging issues in the cluster.

Yes, looks like with the current implementation (from redis/redis#13141):

  1. To re-enable error stats tracking, we need to use CONFIG RESETSTAT which will clear command stat counters and error stat counters. So, we will lose visibility in both of these.
  2. When misuse is detected (> 128 error types), we prevent tracking of regular (non LUA) error types. e.g. MOVED, CLUSTERDOWN, etc.

These issues aside, the PR addresses the concerns we have perfectly.

Is it possible to record all custom errors from lua in a separate RAX (based on scriptIsRunning() and if the error originated from luaRedisErrorReplyCommand) and show the custom errors only when using INFO ERRORSTATS explicitly?

The proposal in this issue is to prevent tracking new error messages from LUA if the number of LUA error messages is greater than 128. Instead, we will track any additional LUA error type in a new counter: errorstat_LUA and if a non-LUA error (MOVED / CLUSTERDOWN) occurs, they will continue to be tracked as usual.

This will address the issue of spammed error messages / memory usage of the errors RAX. Additionally, we will not have to execute CONFIG RESETSTAT to restore functionality - as normal error messages continue to be tracked.

@madolson
Copy link
Member

I like the idea of tracking lua errors in a separate RAX, however it can't just be based off of scriptIsRunning, since the script could generate normal error messages. I can't think of anything off the top of my head to tell if it was a user generated error message or not.

@KarthikSubbarao KarthikSubbarao changed the title [NEW] Config to disable including custom error messages (LUA) in the INFO ERRORSTATS section [NEW] Handle spamming of custom error messages (LUA) in the INFO ERRORSTATS section with continued functioning of non LUA errors May 15, 2024
@KarthikSubbarao KarthikSubbarao changed the title [NEW] Handle spamming of custom error messages (LUA) in the INFO ERRORSTATS section with continued functioning of non LUA errors [NEW] Handle spamming of custom LUA error messages in the INFO ERRORSTATS section with continued tracking of non LUA errors stats May 15, 2024
@ranshid
Copy link
Member

ranshid commented May 15, 2024

@KarthikSubbarao Thank you for driving this! I am reading through the PR and will provide technical comments there (if I have any) but wanted to ask here:
Why is that so different than the modules case? I think this will provide solution for LUA, but modules can cause the same kind of missuse right? I feel this error stats mechanism lacks some better structure than just be based on string parsing. Maybe we should consider improving that in the future?

@KarthikSubbarao
Copy link
Member Author

KarthikSubbarao commented May 15, 2024

@ranshid Thank you. Modules are similar in that they can result in custom error messages - but I would place Modules closer to the Valkey server in that it is the responsibility of those running the server to ensure that the modules they load do not lead to spamming the ERRORSTATS section. EVAL / EVALSHA commands, however, are completely customer driven. This is why the Issue here started off focusing only on LUA.

But (if required) we can indeed add support for preventing misuse by Modules by passing a flag in the VM_ReplyWithError and VM_ReplyWithErrorFormat APIs. I can update the PR if so

@PingXie
Copy link
Member

PingXie commented Jun 22, 2024

I like the idea of tracking lua errors in a separate RAX, however it can't just be based off of scriptIsRunning, since the script could generate normal error messages. I can't think of anything off the top of my head to tell if it was a user generated error message or not.

I guess that depends on the definition of "lua errors". If they are interpreted as "custom errors" then agreed but do we see value in interpreting them as the origin of the errors? Meaning any errors, both custom and normal, generated by any scripts go to its own RAX. If we go down this path, we would have to create a dedicated errorstat section just for scripts ( errorstat_script_*?). Since this is already a breaking change, I think it would be worthwhile expanding the discussion a bit.

@madolson
Copy link
Member

If we go down this path, we would have to create a dedicated errorstat section just for scripts ( errorstat_script_*

I don't like this approach. I don't think we should have a special section just for scripts, since we try to treat the scripts as more or less normal clients. It also makes this definitely a break changing for users that are using this feature, as opposed to just a breaking change for users that are misusing the feature.

@PingXie
Copy link
Member

PingXie commented Jun 25, 2024

It also makes this definitely a break changing for users that are using this feature, as opposed to just a breaking change for users that are misusing the feature.

Good point.

I like the idea of tracking lua errors in a separate RAX, however it can't just be based off of scriptIsRunning

Looks like addReplyErrorFormatEx is what is used to record the custom error in the COB?

https://github.com/valkey-io/valkey/blob/unstable/src/script_lua.c#L620.

We could pass a flag to it to differentiate custom lua errors?

@madolson
Copy link
Member

madolson commented Jun 25, 2024

We could pass a flag to it to differentiate custom lua errors?

Yes. They could be emulating a normal error though.

@PingXie
Copy link
Member

PingXie commented Jun 25, 2024

We could pass a flag to it to differentiate custom lua errors?

Yes. They could be emulating a normal error though.

I wouldn't consider these "emulated" errors "normal" anymore ... I think it is fine to lump them all into "custom" errors, as far as storage is concerned so we don't lose other more important errors.

@enjoy-binbin
Copy link
Member

Yes, looks like with the current implementation (from redis/redis#13141):

  1. To re-enable error stats tracking, we need to use CONFIG RESETSTAT which will clear command stat counters and error stat counters. So, we will lose visibility in both of these.
  2. When misuse is detected (> 128 error types), we prevent tracking of regular (non LUA) error types. e.g. MOVED, CLUSTERDOWN, etc.

Yes, I didn't realize that config resetstat is an admin command and may not be easy to call.
And, i thought about continuing to track existing errors when the limit was reached, but in the implementation at the time, we couldn't add new error codes once the number was reached, so the code does not make a big different.

I think the difficulty here is how many error codes we can allow the script to customize. The scenes I have seen so far are all misuse, maybe there is a case that user do need to use a lot error code.

One way is that we set ERROR_STATS_NUMBER as a configuration item?

btw, what PR #500 does is that when it encounters a miuse, it ensures that other normal errors can be counted instead of being disabled. I guess i am ok with that, if there is no miuse, everyone will be happy, if there is a miuse, at least you will want to check the errorstats. People still need to call CONFIG RESETSTATS to clear the miuse errorstats.

@madolson
Copy link
Member

madolson commented Jul 8, 2024

I think the difficulty here is how many error codes we can allow the script to customize. The scenes I have seen so far are all misuse, maybe there is a case that user do need to use a lot error code.

I've only seen on the order of like 10-15 custom error codes, I think you're right that this is likely only a misuse. I suppose my concern is that I would like the "normal" errors to continue working for debug purposes. I'm kind of okay with the tradeoff that #500 proposes, we continue to report normal errors but we stop reporting LUA errors.

@enjoy-binbin Just to understand your PoV, are you okay with #500, I'll try to followup and try to get it ready to merge this week if that is the case.

@enjoy-binbin
Copy link
Member

I'm kind of okay with the tradeoff that #500 proposes, we continue to report normal errors but we stop reporting LUA errors.

yean, i am ok with that.

madolson pushed a commit that referenced this issue Jul 15, 2024
…m errors to be tracked normally (#500)

Implementing the change proposed here:
#487

In this PR, we prevent tracking new custom error messages (e.g. LUA) if
the number of error messages (in the errors RAX) is greater than 128.
Instead, we will track any additional custom error prefix in a new
counter: `errorstat_ERRORSTATS_OVERFLOW ` and if any non-custom flagged
errors (e.g. MOVED / CLUSTERDOWN) occur, they will continue to be
tracked as usual.

This will address the issue of spammed error messages / memory usage of
the errors RAX. Additionally, we will not have to execute `CONFIG
RESETSTAT` to restore error stats functionality because normal error
messages continue to be tracked.

Example:
```
# Errorstats
.
.
.
errorstat_127:count=2
errorstat_128:count=2
errorstat_ERR:count=1
errorstat_ERRORSTATS_OVERFLOW:count=2
```

---------

Signed-off-by: Karthik Subbarao <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants