From fa304295999c0ef75179bac2b08a50a7171ae350 Mon Sep 17 00:00:00 2001 From: Jonah Calvo Date: Fri, 18 Aug 2023 13:11:39 -0500 Subject: [PATCH 1/6] Update documentation for new AD settings Signed-off-by: Jonah Calvo --- .../pipelines/configuration/processors/anomaly-detector.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/_data-prepper/pipelines/configuration/processors/anomaly-detector.md b/_data-prepper/pipelines/configuration/processors/anomaly-detector.md index 2010c53856..768d8ab283 100644 --- a/_data-prepper/pipelines/configuration/processors/anomaly-detector.md +++ b/_data-prepper/pipelines/configuration/processors/anomaly-detector.md @@ -18,6 +18,10 @@ You can configure the anomaly detector processor by specifying a key and the opt | :--- | :--- | :--- | | `keys` | Yes | A non-ordered `List` that is used as input to the ML algorithm to detect anomalies in the values of the keys in the list. At least one key is required. | `mode` | Yes | The ML algorithm (or model) used to detect anomalies. You must provide a mode. See [random_cut_forest mode](#random_cut_forest-mode). +| `identification_keys` | No | If provided, anomalies will be detected within each unique instance of this key. For example, providing `ip` here will have anomalies detected seperately for each unique IP address. +| `cardinality_limit` | No | If using `identification_keys`, a new ML model will be created for every degree of cardinality. This can cause a large amount of memory usage, so setting a limit to the number of models is useful. Defaults to 5000. +| `verbose` | No | By default, the RCF algorithm will alert once on a level shift. For example, if latency is consistently 50-100 and jumps to consistently ~1000, only one anomaly will be detected. Setting `verbose` to `true` will alert many times for such a shift. + ### Keys From a9caf4d9e1b15f9cd9b9745fa7ef1870ffefe504 Mon Sep 17 00:00:00 2001 From: Jonah Calvo Date: Fri, 18 Aug 2023 16:29:24 -0500 Subject: [PATCH 2/6] update wording for verbose Signed-off-by: Jonah Calvo --- .../pipelines/configuration/processors/anomaly-detector.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/processors/anomaly-detector.md b/_data-prepper/pipelines/configuration/processors/anomaly-detector.md index 768d8ab283..2a819175b0 100644 --- a/_data-prepper/pipelines/configuration/processors/anomaly-detector.md +++ b/_data-prepper/pipelines/configuration/processors/anomaly-detector.md @@ -20,7 +20,7 @@ You can configure the anomaly detector processor by specifying a key and the opt | `mode` | Yes | The ML algorithm (or model) used to detect anomalies. You must provide a mode. See [random_cut_forest mode](#random_cut_forest-mode). | `identification_keys` | No | If provided, anomalies will be detected within each unique instance of this key. For example, providing `ip` here will have anomalies detected seperately for each unique IP address. | `cardinality_limit` | No | If using `identification_keys`, a new ML model will be created for every degree of cardinality. This can cause a large amount of memory usage, so setting a limit to the number of models is useful. Defaults to 5000. -| `verbose` | No | By default, the RCF algorithm will alert once on a level shift. For example, if latency is consistently 50-100 and jumps to consistently ~1000, only one anomaly will be detected. Setting `verbose` to `true` will alert many times for such a shift. +| `verbose` | No | RCF will try to auto learn and reduce the number of anomalies. For example if latency is consistently 50-100 and jumps to consistently ~1000, only the first few points after the transition will be detected (unless there are other spikes/anomalies). Likewise, for repeated spikes to the same level, RCF will likely eliminate many of the spikes after a few initial ones. The goal of this default setting is to minimize alerts. Setting verbose to true will alert consistently on these repeated cases and may be useful in detecting anomalous behavior that lasts an extended period of time. ### Keys From 7add23c0c12b4691a5e14ba61cb16885dd70782a Mon Sep 17 00:00:00 2001 From: Jonah Calvo Date: Thu, 14 Sep 2023 12:01:26 -0500 Subject: [PATCH 3/6] Update _data-prepper/pipelines/configuration/processors/anomaly-detector.md Co-authored-by: Melissa Vagi Signed-off-by: Jonah Calvo --- .../pipelines/configuration/processors/anomaly-detector.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/processors/anomaly-detector.md b/_data-prepper/pipelines/configuration/processors/anomaly-detector.md index 2a819175b0..3020d7ae8f 100644 --- a/_data-prepper/pipelines/configuration/processors/anomaly-detector.md +++ b/_data-prepper/pipelines/configuration/processors/anomaly-detector.md @@ -18,7 +18,7 @@ You can configure the anomaly detector processor by specifying a key and the opt | :--- | :--- | :--- | | `keys` | Yes | A non-ordered `List` that is used as input to the ML algorithm to detect anomalies in the values of the keys in the list. At least one key is required. | `mode` | Yes | The ML algorithm (or model) used to detect anomalies. You must provide a mode. See [random_cut_forest mode](#random_cut_forest-mode). -| `identification_keys` | No | If provided, anomalies will be detected within each unique instance of this key. For example, providing `ip` here will have anomalies detected seperately for each unique IP address. +| `identification_keys` | No | If provided, anomalies will be detected within each unique instance of this key. For example, if you provide the `ip` field, anomalies will be detected separately for each unique IP address. | `cardinality_limit` | No | If using `identification_keys`, a new ML model will be created for every degree of cardinality. This can cause a large amount of memory usage, so setting a limit to the number of models is useful. Defaults to 5000. | `verbose` | No | RCF will try to auto learn and reduce the number of anomalies. For example if latency is consistently 50-100 and jumps to consistently ~1000, only the first few points after the transition will be detected (unless there are other spikes/anomalies). Likewise, for repeated spikes to the same level, RCF will likely eliminate many of the spikes after a few initial ones. The goal of this default setting is to minimize alerts. Setting verbose to true will alert consistently on these repeated cases and may be useful in detecting anomalous behavior that lasts an extended period of time. From 6446c98f236f5d2ede6ca2ec75f13b716fa0193a Mon Sep 17 00:00:00 2001 From: Jonah Calvo Date: Thu, 14 Sep 2023 12:01:36 -0500 Subject: [PATCH 4/6] Update _data-prepper/pipelines/configuration/processors/anomaly-detector.md Co-authored-by: Melissa Vagi Signed-off-by: Jonah Calvo --- .../pipelines/configuration/processors/anomaly-detector.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/processors/anomaly-detector.md b/_data-prepper/pipelines/configuration/processors/anomaly-detector.md index 3020d7ae8f..1027abad12 100644 --- a/_data-prepper/pipelines/configuration/processors/anomaly-detector.md +++ b/_data-prepper/pipelines/configuration/processors/anomaly-detector.md @@ -19,7 +19,7 @@ You can configure the anomaly detector processor by specifying a key and the opt | `keys` | Yes | A non-ordered `List` that is used as input to the ML algorithm to detect anomalies in the values of the keys in the list. At least one key is required. | `mode` | Yes | The ML algorithm (or model) used to detect anomalies. You must provide a mode. See [random_cut_forest mode](#random_cut_forest-mode). | `identification_keys` | No | If provided, anomalies will be detected within each unique instance of this key. For example, if you provide the `ip` field, anomalies will be detected separately for each unique IP address. -| `cardinality_limit` | No | If using `identification_keys`, a new ML model will be created for every degree of cardinality. This can cause a large amount of memory usage, so setting a limit to the number of models is useful. Defaults to 5000. +| `cardinality_limit` | No | If using the `identification_keys` settings, a new ML model will be created for every degree of cardinality. This can cause a large amount of memory usage, so it is helpful to set a limit on the number of models. Default limit is 5000. | `verbose` | No | RCF will try to auto learn and reduce the number of anomalies. For example if latency is consistently 50-100 and jumps to consistently ~1000, only the first few points after the transition will be detected (unless there are other spikes/anomalies). Likewise, for repeated spikes to the same level, RCF will likely eliminate many of the spikes after a few initial ones. The goal of this default setting is to minimize alerts. Setting verbose to true will alert consistently on these repeated cases and may be useful in detecting anomalous behavior that lasts an extended period of time. From 418aa5b933a0b53b690ea85f52acb2085f031066 Mon Sep 17 00:00:00 2001 From: Jonah Calvo Date: Thu, 14 Sep 2023 12:02:40 -0500 Subject: [PATCH 5/6] Update _data-prepper/pipelines/configuration/processors/anomaly-detector.md Co-authored-by: Melissa Vagi Signed-off-by: Jonah Calvo --- .../pipelines/configuration/processors/anomaly-detector.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/_data-prepper/pipelines/configuration/processors/anomaly-detector.md b/_data-prepper/pipelines/configuration/processors/anomaly-detector.md index 1027abad12..d5184118ba 100644 --- a/_data-prepper/pipelines/configuration/processors/anomaly-detector.md +++ b/_data-prepper/pipelines/configuration/processors/anomaly-detector.md @@ -20,7 +20,7 @@ You can configure the anomaly detector processor by specifying a key and the opt | `mode` | Yes | The ML algorithm (or model) used to detect anomalies. You must provide a mode. See [random_cut_forest mode](#random_cut_forest-mode). | `identification_keys` | No | If provided, anomalies will be detected within each unique instance of this key. For example, if you provide the `ip` field, anomalies will be detected separately for each unique IP address. | `cardinality_limit` | No | If using the `identification_keys` settings, a new ML model will be created for every degree of cardinality. This can cause a large amount of memory usage, so it is helpful to set a limit on the number of models. Default limit is 5000. -| `verbose` | No | RCF will try to auto learn and reduce the number of anomalies. For example if latency is consistently 50-100 and jumps to consistently ~1000, only the first few points after the transition will be detected (unless there are other spikes/anomalies). Likewise, for repeated spikes to the same level, RCF will likely eliminate many of the spikes after a few initial ones. The goal of this default setting is to minimize alerts. Setting verbose to true will alert consistently on these repeated cases and may be useful in detecting anomalous behavior that lasts an extended period of time. +| `verbose` | No | RCF will try to automatically learn and reduce the number of anomalies detected. For example, if latency is consistently between 50 and 100, and then suddenly jumps to around 1000, only the first few points after the transition will be detected (unless there are other spikes/anomalies). Similarly, for repeated spikes to the same level, RCF will likely eliminate many of the spikes after a few initial ones. This is because the default setting is to minimize the number of alerts detected. Setting the `verbose` setting to `true` will cause RCF to consistently detect these repeated cases, which may be useful for detecting anomalous behavior that lasts an extended period of time. ### Keys From 62b99dcfae445dc916a707be410bf2e06c06a67b Mon Sep 17 00:00:00 2001 From: Jonah Calvo Date: Thu, 14 Sep 2023 12:05:48 -0500 Subject: [PATCH 6/6] Remove 'few' from description Signed-off-by: Jonah Calvo --- .../pipelines/configuration/processors/anomaly-detector.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/_data-prepper/pipelines/configuration/processors/anomaly-detector.md b/_data-prepper/pipelines/configuration/processors/anomaly-detector.md index d5184118ba..9628bb6caf 100644 --- a/_data-prepper/pipelines/configuration/processors/anomaly-detector.md +++ b/_data-prepper/pipelines/configuration/processors/anomaly-detector.md @@ -20,7 +20,7 @@ You can configure the anomaly detector processor by specifying a key and the opt | `mode` | Yes | The ML algorithm (or model) used to detect anomalies. You must provide a mode. See [random_cut_forest mode](#random_cut_forest-mode). | `identification_keys` | No | If provided, anomalies will be detected within each unique instance of this key. For example, if you provide the `ip` field, anomalies will be detected separately for each unique IP address. | `cardinality_limit` | No | If using the `identification_keys` settings, a new ML model will be created for every degree of cardinality. This can cause a large amount of memory usage, so it is helpful to set a limit on the number of models. Default limit is 5000. -| `verbose` | No | RCF will try to automatically learn and reduce the number of anomalies detected. For example, if latency is consistently between 50 and 100, and then suddenly jumps to around 1000, only the first few points after the transition will be detected (unless there are other spikes/anomalies). Similarly, for repeated spikes to the same level, RCF will likely eliminate many of the spikes after a few initial ones. This is because the default setting is to minimize the number of alerts detected. Setting the `verbose` setting to `true` will cause RCF to consistently detect these repeated cases, which may be useful for detecting anomalous behavior that lasts an extended period of time. +| `verbose` | No | RCF will try to automatically learn and reduce the number of anomalies detected. For example, if latency is consistently between 50 and 100, and then suddenly jumps to around 1000, only the first one or two data points after the transition will be detected (unless there are other spikes/anomalies). Similarly, for repeated spikes to the same level, RCF will likely eliminate many of the spikes after a few initial ones. This is because the default setting is to minimize the number of alerts detected. Setting the `verbose` setting to `true` will cause RCF to consistently detect these repeated cases, which may be useful for detecting anomalous behavior that lasts an extended period of time. ### Keys @@ -73,4 +73,4 @@ ad-pipeline: When you run the anomaly detector processor, the processor extracts the value for the `latency` key, and then passes the value through the RCF ML algorithm. You can configure any key that comprises integers or real numbers as values. In the following example, you can configure `bytes` or `latency` as the key for an anomaly detector. -`{"ip":"1.2.3.4", "bytes":234234, "latency":0.2}` \ No newline at end of file +`{"ip":"1.2.3.4", "bytes":234234, "latency":0.2}`