Skip to content

Commit

Permalink
New trendline ppl command (WMA) (opensearch-project#872)
Browse files Browse the repository at this point in the history
* WMA implementation

Signed-off-by: Andy Kwok <[email protected]>

* Update test cases

Signed-off-by: Andy Kwok <[email protected]>

* Update tests

Signed-off-by: Andy Kwok <[email protected]>

* Refactor code

Signed-off-by: Andy Kwok <[email protected]>

* Addres comments

Signed-off-by: Andy Kwok <[email protected]>

* Update doc

Signed-off-by: Andy Kwok <[email protected]>

* Update example

Signed-off-by: Andy Kwok <[email protected]>

* Update readme

Signed-off-by: Andy Kwok <[email protected]>

* Update scalafmt

Signed-off-by: Andy Kwok <[email protected]>

* Update grammar rule

Signed-off-by: Andy Kwok <[email protected]>

* Address review comments

Signed-off-by: Andy Kwok <[email protected]>

* Address review comments

Signed-off-by: Andy Kwok <[email protected]>

---------

Signed-off-by: Andy Kwok <[email protected]>
  • Loading branch information
andy-k-improving authored Nov 14, 2024
1 parent a80aa04 commit 439cf3e
Show file tree
Hide file tree
Showing 11 changed files with 658 additions and 37 deletions.
12 changes: 12 additions & 0 deletions DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,11 @@ To execute the unit tests, run the following command:
```
sbt test
```
To run a specific unit test in SBT, use the testOnly command with the full path of the test class:
```
sbt "; project pplSparkIntegration; test:testOnly org.opensearch.flint.spark.ppl.PPLLogicalPlanTrendlineCommandTranslatorTestSuite"
```


## Integration Test
The integration test is defined in the `integration` directory of the project. The integration tests will automatically trigger unit tests and will only run if all unit tests pass. If you want to run the integration test for the project, you can do so by running the following command:
Expand All @@ -23,6 +28,13 @@ If you get integration test failures with error message "Previous attempts to fi
3. Run `sudo ln -s $HOME/.docker/desktop/docker.sock /var/run/docker.sock` or `sudo ln -s $HOME/.docker/run/docker.sock /var/run/docker.sock`
4. If you use Docker Desktop, as an alternative of `3`, check mark the "Allow the default Docker socket to be used (requires password)" in advanced settings of Docker Desktop.

Running only a selected set of integration test suites is possible with the following command:
```
sbt "; project integtest; it:testOnly org.opensearch.flint.spark.ppl.FlintSparkPPLTrendlineITSuite"
```
This command runs only the specified test suite within the integtest submodule.


### AWS Integration Test
The `aws-integration` folder contains tests for cloud server providers. For instance, test against AWS OpenSearch domain, configure the following settings. The client will use the default credential provider to access the AWS OpenSearch domain.
```
Expand Down
1 change: 1 addition & 0 deletions docs/ppl-lang/PPL-Example-Commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ _- **Limitation: new field added by eval command with a function cannot be dropp
- `source = table | where cidrmatch(ip, '192.169.1.0/24')`
- `source = table | where cidrmatch(ipv6, '2003:db8::/32')`
- `source = table | trendline sma(2, temperature) as temp_trend`
- `source = table | trendline sort timestamp wma(2, temperature) as temp_trend`

#### **IP related queries**
[See additional command details](functions/ppl-ip.md)
Expand Down
64 changes: 58 additions & 6 deletions docs/ppl-lang/ppl-trendline-command.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,7 @@
**Description**
Using ``trendline`` command to calculate moving averages of fields.


### Syntax
### Syntax - SMA (Simple Moving Average)
`TRENDLINE [sort <[+|-] sort-field>] SMA(number-of-datapoints, field) [AS alias] [SMA(number-of-datapoints, field) [AS alias]]...`

* [+|-]: optional. The plus [+] stands for ascending order and NULL/MISSING first and a minus [-] stands for descending order and NULL/MISSING last. **Default:** ascending order and NULL/MISSING first.
Expand All @@ -13,8 +12,6 @@ Using ``trendline`` command to calculate moving averages of fields.
* field: mandatory. the name of the field the moving average should be calculated for.
* alias: optional. the name of the resulting column containing the moving average.

And the moment only the Simple Moving Average (SMA) type is supported.

It is calculated like

f[i]: The value of field 'f' in the i-th data-point
Expand All @@ -23,7 +20,7 @@ It is calculated like

SMA(t) = (1/n) * Σ(f[i]), where i = t-n+1 to t

### Example 1: Calculate simple moving average for a timeseries of temperatures
#### Example 1: Calculate simple moving average for a timeseries of temperatures

The example calculates the simple moving average over temperatures using two datapoints.

Expand All @@ -41,7 +38,7 @@ PPL query:
| 15| 258|2023-04-06 17:07:...| 14.5|
+-----------+---------+--------------------+----------+

### Example 2: Calculate simple moving averages for a timeseries of temperatures with sorting
#### Example 2: Calculate simple moving averages for a timeseries of temperatures with sorting

The example calculates two simple moving average over temperatures using two and three datapoints sorted descending by device-id.

Expand All @@ -58,3 +55,58 @@ PPL query:
| 12| 1492|2023-04-06 17:07:...| 12.5| 13.0|
| 12| 1492|2023-04-06 17:07:...| 12.0|12.333333333333334|
+-----------+---------+--------------------+------------+------------------+


### Syntax - WMA (Weighted Moving Average)
`TRENDLINE sort <[+|-] sort-field> WMA(number-of-datapoints, field) [AS alias] [WMA(number-of-datapoints, field) [AS alias]]...`

* [+|-]: optional. The plus [+] stands for ascending order and NULL/MISSING first and a minus [-] stands for descending order and NULL/MISSING last. **Default:** ascending order and NULL/MISSING first.
* sort-field: mandatory. this field specifies the ordering of data poients when calculating the nth_value aggregation.
* number-of-datapoints: mandatory. number of datapoints to calculate the moving average (must be greater than zero).
* field: mandatory. the name of the field the moving averag should be calculated for.
* alias: optional. the name of the resulting column containing the moving average.

It is calculated like

f[i]: The value of field 'f' in the i-th data point
n: The number of data points in the moving window (period)
t: The current time index
w[i]: The weight assigned to the i-th data point, typically increasing for more recent points

WMA(t) = ( Σ from i=t−n+1 to t of (w[i] * f[i]) ) / ( Σ from i=t−n+1 to t of w[i] )

#### Example 1: Calculate weighted moving average for a timeseries of temperatures

The example calculates the simple moving average over temperatures using two datapoints.

PPL query:

os> source=t | trendline sort timestamp wma(2, temperature) as temp_trend;
fetched rows / total rows = 5/5
+-----------+---------+--------------------+----------+
|temperature|device-id| timestamp|temp_trend|
+-----------+---------+--------------------+----------+
| 12| 1492|2023-04-06 17:07:...| NULL|
| 12| 1492|2023-04-06 17:07:...| 12.0|
| 13| 256|2023-04-06 17:07:...| 12.6|
| 14| 257|2023-04-06 17:07:...| 13.6|
| 15| 258|2023-04-06 17:07:...| 14.6|
+-----------+---------+--------------------+----------+

#### Example 2: Calculate simple moving averages for a timeseries of temperatures with sorting

The example calculates two simple moving average over temperatures using two and three datapoints sorted descending by device-id.

PPL query:

os> source=t | trendline sort - device-id wma(2, temperature) as temp_trend_2 wma(3, temperature) as temp_trend_3;
fetched rows / total rows = 5/5
+-----------+---------+--------------------+------------+------------------+
|temperature|device-id| timestamp|temp_trend_2| temp_trend_3|
+-----------+---------+--------------------+------------+------------------+
| 15| 258|2023-04-06 17:07:...| NULL| NULL|
| 14| 257|2023-04-06 17:07:...| 14.3| NULL|
| 13| 256|2023-04-06 17:07:...| 13.3| 13.6|
| 12| 1492|2023-04-06 17:07:...| 12.3| 12.6|
| 12| 1492|2023-04-06 17:07:...| 12.0| 12.16|
+-----------+---------+--------------------+------------+------------------+
Loading

0 comments on commit 439cf3e

Please sign in to comment.