Merge pull request #123 from getdozer/ttl-sample

added ttl sql sample
getdozer · Oct 30, 2023 · 923bd33 · 923bd33
2 parents d3ed937 + 9c949b7
commit 923bd33
Show file tree

Hide file tree

Showing 5 changed files with 169 additions and 14 deletions.
diff --git a/README.md b/README.md
@@ -40,6 +40,7 @@ Refer to the [Installation section](https://getdozer.io/docs/installation) for i
 |                  | [Using Sub queries](./sql/sub-queries)                                   | How to use sub queries in Dozer                                              |
 |                  | [Using UNIONs](./sql/union)                                              | How to combine data using `UNION` in Dozer                                   |
 |                  | [Using Window Functions](./sql/window-functions)                         | Use `Hop` and `Tumble` Windows                                               |
+|                  | [Using TTL](./sql/ttl)                                                   | Use `TTL` to manage memory usage                                             |
 |                  |                                                                          |                                                                              |
 | Use Cases        | [Flight Microservices](./usecases/pg-flights)                            | Build APIs over multiple microservices.                                      |
 |                  | [Scaling Ecommerce](./usecases/scaling-ecommerce)                        | Profile and benchmark Dozer using an ecommerce data set                      |

diff --git a/sql/README.md b/sql/README.md
@@ -2,7 +2,7 @@
 
 This is a comprehensive guide showcasing different types of queries possible with Dozer SQL.
 
-## Dataset 
+## Dataset
 
 We will be using two tables throughout this guide. These tables are from [NYC - TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). To download these run the command,
 
@@ -16,8 +16,7 @@ This table is contained in a parquet file under `data/trips/fhvhv_tripdata_2022-
 
 ![table_1_image](/sql/images/table_1.png)
 
-
-### Table 2: taxi_zone_lookup 
+### Table 2: taxi_zone_lookup
 
 This table is contained in a csv file under `data/zones/taxi_zone_lookup.csv`.
 
@@ -36,20 +35,22 @@ Hence, the basic statement structure is,
 ```sql
  SELECT A INTO C FROM B;
 ```
+
 The datatypes and casting compatible with Dozer SQL are described in the [documentation for datatypes and casting](https://getdozer.io/docs/transforming-data/data-types).
 
 Dozer SQL also supports primitive scalar function described in [documentation for scalar functions](https://getdozer.io/docs/transforming-data/scalar-functions).
 
 ## Table of contents
 
-Let us start with basic Dozer SQL queries and move towards more complex queries. 
-
-| Sr.no | Query type | Description                                                          |
-| ----- | ---------- | -------------------------------------------------------------------- |
-|   1   |   [Filtering](./filtering/README.md)    |  A simple select operation with a `WHERE` clause  |
-|   2   |   [Aggregation](./aggregation/README.md)   |  Multiple queries each describing a specifc aggregation on the data  |
-|   3   |   [JOIN](./join/README.md)   |  Query to JOIN the tables based on `LocationID` |
-|   4   |   [CTEs](./cte/README.md)   |  Query with two CTE tables JOINed after filtering |
-|   5   |   [Sub queries](./sub-queries/README.md)   |  Multiple queries describing nested `SELECT` statements |
-|   6   |   [UNION](./union/README.md)   |  A `UNION` peformed inside a CTE, followed by a `JOIN` |
-|   7   |   [Window functions](./window-functions/README.md)   |  Queries describing the use of `TUMBLE` and `HOP` |
+Let us start with basic Dozer SQL queries and move towards more complex queries.
+
+| Sr.no | Query type                                       | Description                                                        |
+| ----- | ------------------------------------------------ | ------------------------------------------------------------------ |
+| 1     | [Filtering](./filtering/README.md)               | A simple select operation with a `WHERE` clause                    |
+| 2     | [Aggregation](./aggregation/README.md)           | Multiple queries each describing a specifc aggregation on the data |
+| 3     | [JOIN](./join/README.md)                         | Query to JOIN the tables based on `LocationID`                     |
+| 4     | [CTEs](./cte/README.md)                          | Query with two CTE tables JOINed after filtering                   |
+| 5     | [Sub queries](./sub-queries/README.md)           | Multiple queries describing nested `SELECT` statements             |
+| 6     | [UNION](./union/README.md)                       | A `UNION` peformed inside a CTE, followed by a `JOIN`              |
+| 7     | [Window functions](./window-functions/README.md) | Queries describing the use of `TUMBLE` and `HOP`                   |
+| 8     | [TTL](./ttl/README.md)                           | Queries describing the use of `TTL`                                |
diff --git a/sql/images/ttl_graph.png b/sql/images/ttl_graph.png
diff --git a/sql/ttl/README.md b/sql/ttl/README.md
@@ -0,0 +1,107 @@
+# TTL function example
+
+This example shows how to use the Time To Live(TTL) function using Dozer SQL.
+
+The TTL function provides a way to manage the memory usage in Dozer, particularly when dealing with vast streams of data. By setting up a TTL, it ensures that only relevant (or recent) data is held in memory, providing a balance between data retention and memory efficiency. TTL is based on the record's timestamp, ensuring that data eviction is contextually relevant.
+
+To read more about window functions read the [documentation](https://getdozer.io/docs/transforming-data/windowing#ttl).
+
+Here we describe two queries that will only use fresh data obtained over a 5 minute window,
+
+- Query to calculate the sum of tips obtained for a particular Pickup location over a 2 minutes window.
+
+- Query to calculate the sum of tips obtained for a particular Pickup location over a 3 minutes window but the windows overlap by 1 minutes.
+  i.e. the 3 minutes is divided into,
+  - 1 minutes overlapping with past window
+  - 1 minute non overlapping
+  - 1 minutes overlapping with next window
+
+## SQL Query and Structure
+
+### Query 1
+
+```sql
+  SELECT t.PULocationID as location, SUM(t.tips) AS total_tips, t.window_start as start, t.window_end AS end
+  INTO table1
+  FROM TTL(TUMBLE(trips, pickup_datetime, '2 MINUTES'), pickup_datetime, '5 MINUTES') t
+  GROUP BY t.PULocationID, t.window_start, t.window_end;
+```
+
+### Query 2
+
+```sql
+  SELECT t.PULocationID as location, SUM(t.tips) AS total_tips, t.window_start as start, t.window_end AS end
+  INTO table2
+  FROM TTL(HOP(trips, pickup_datetime, '1 MINUTE', '3 MINUTES'), pickup_datetime, '5 MINUTES') t
+  GROUP BY t.PULocationID, t.window_start, t.window_end;
+```
+
+![ttl_graph](../images/ttl_graph.png)
+
+## Running
+
+### Dozer
+
+To run Dozer navigate to the join folder `/sql/ttl` & use the following command
+
+```bash
+dozer run
+```
+
+To remove the cache directory, use
+
+```bash
+dozer clean
+```
+
+### Dozer Live
+
+To run with Dozer live, replace `run` with `live`
+
+```bash
+dozer live
+```
+
+Dozer live automatically deletes the cache upon stopping the program.
+
+## Querying Dozer
+
+Dozer API lets us use `filter`,`limit`,`order_by` and `skip` at the endpoints. For this example lets order the data in descending order of the sum of `tips`.
+
+Execute the following commands over bash to get the results from `REST` and `gRPC` APIs.
+
+### Query 1
+
+**`REST`**
+
+```bash
+curl -X POST  http://localhost:8080/tumble_ttl/query \
+--header 'Content-Type: application/json' \
+--data-raw '{"$order_by": {"total_tips": "desc"}}'
+```
+
+**`gRPC`**
+
+```bash
+grpcurl -d '{"endpoint": "tumble_ttl", "query": "{\"$order_by\": {\"total_tips\": \"desc\"}}"}' \
+-plaintext localhost:50051 \
+dozer.common.CommonGrpcService/query
+```
+
+### Query 2
+
+**`REST`**
+
+```bash
+curl -X POST  http://localhost:8080/hop_ttl/query \
+--header 'Content-Type: application/json' \
+--data-raw '{"$order_by": {"total_tips": "desc"}}'
+```
+
+**`gRPC`**
+
+```bash
+grpcurl -d '{"endpoint": "hop_ttl", "query": "{\"$order_by\": {\"total_tips\": \"desc\"}}"}' \
+-plaintext localhost:50051 \
+dozer.common.CommonGrpcService/query
+```
diff --git a/sql/ttl/dozer-config.yaml b/sql/ttl/dozer-config.yaml
@@ -0,0 +1,46 @@
+app_name: ttl-sample
+version: 1
+
+connections:
+  - config: !LocalStorage
+      details:
+        path: ../data
+      tables:
+        - !Table
+          name: trips
+          config: !Parquet
+            path: trips
+            extension: .parquet
+    name: ny_taxi
+
+sources:
+  - name: trips
+    table_name: trips
+    connection: ny_taxi
+
+sql: |
+
+  -- get the total tips for each location in a 2 minute window
+
+  SELECT t.PULocationID as location, SUM(t.tips) AS total_tips, t.window_start as start, t.window_end AS end
+  INTO table1
+  FROM TTL(TUMBLE(trips, pickup_datetime, '2 MINUTES'), pickup_datetime, '5 MINUTES') t
+  GROUP BY t.PULocationID, t.window_start, t.window_end;
+
+  -- get the total tips for each location where every window of 3 minutes overlaps by 1 minutes
+
+  SELECT t.PULocationID as location, SUM(t.tips) AS total_tips, t.window_start as start, t.window_end AS end
+  INTO table2
+  FROM TTL(HOP(trips, pickup_datetime, '1 MINUTE', '3 MINUTES'), pickup_datetime, '5 MINUTES') t
+  GROUP BY t.PULocationID, t.window_start, t.window_end;
+
+endpoints:
+  - name: tumble_ttl
+    path: /tumble_ttl
+    table_name: table1
+
+  - name: hop_ttl
+    path: /hop_ttl
+    table_name: table2
+
+cache_max_map_size: 2147483648