Skip to content

Commit

Permalink
Merge pull request #123 from getdozer/ttl-sample
Browse files Browse the repository at this point in the history
added ttl sql sample
  • Loading branch information
supergi0 authored Oct 30, 2023
2 parents d3ed937 + 9c949b7 commit 923bd33
Show file tree
Hide file tree
Showing 5 changed files with 169 additions and 14 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ Refer to the [Installation section](https://getdozer.io/docs/installation) for i
| | [Using Sub queries](./sql/sub-queries) | How to use sub queries in Dozer |
| | [Using UNIONs](./sql/union) | How to combine data using `UNION` in Dozer |
| | [Using Window Functions](./sql/window-functions) | Use `Hop` and `Tumble` Windows |
| | [Using TTL](./sql/ttl) | Use `TTL` to manage memory usage |
| | | |
| Use Cases | [Flight Microservices](./usecases/pg-flights) | Build APIs over multiple microservices. |
| | [Scaling Ecommerce](./usecases/scaling-ecommerce) | Profile and benchmark Dozer using an ecommerce data set |
Expand Down
29 changes: 15 additions & 14 deletions sql/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This is a comprehensive guide showcasing different types of queries possible with Dozer SQL.

## Dataset
## Dataset

We will be using two tables throughout this guide. These tables are from [NYC - TLC Trip Record Data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). To download these run the command,

Expand All @@ -16,8 +16,7 @@ This table is contained in a parquet file under `data/trips/fhvhv_tripdata_2022-

![table_1_image](/sql/images/table_1.png)


### Table 2: taxi_zone_lookup
### Table 2: taxi_zone_lookup

This table is contained in a csv file under `data/zones/taxi_zone_lookup.csv`.

Expand All @@ -36,20 +35,22 @@ Hence, the basic statement structure is,
```sql
SELECT A INTO C FROM B;
```

The datatypes and casting compatible with Dozer SQL are described in the [documentation for datatypes and casting](https://getdozer.io/docs/transforming-data/data-types).

Dozer SQL also supports primitive scalar function described in [documentation for scalar functions](https://getdozer.io/docs/transforming-data/scalar-functions).

## Table of contents

Let us start with basic Dozer SQL queries and move towards more complex queries.

| Sr.no | Query type | Description |
| ----- | ---------- | -------------------------------------------------------------------- |
| 1 | [Filtering](./filtering/README.md) | A simple select operation with a `WHERE` clause |
| 2 | [Aggregation](./aggregation/README.md) | Multiple queries each describing a specifc aggregation on the data |
| 3 | [JOIN](./join/README.md) | Query to JOIN the tables based on `LocationID` |
| 4 | [CTEs](./cte/README.md) | Query with two CTE tables JOINed after filtering |
| 5 | [Sub queries](./sub-queries/README.md) | Multiple queries describing nested `SELECT` statements |
| 6 | [UNION](./union/README.md) | A `UNION` peformed inside a CTE, followed by a `JOIN` |
| 7 | [Window functions](./window-functions/README.md) | Queries describing the use of `TUMBLE` and `HOP` |
Let us start with basic Dozer SQL queries and move towards more complex queries.

| Sr.no | Query type | Description |
| ----- | ------------------------------------------------ | ------------------------------------------------------------------ |
| 1 | [Filtering](./filtering/README.md) | A simple select operation with a `WHERE` clause |
| 2 | [Aggregation](./aggregation/README.md) | Multiple queries each describing a specifc aggregation on the data |
| 3 | [JOIN](./join/README.md) | Query to JOIN the tables based on `LocationID` |
| 4 | [CTEs](./cte/README.md) | Query with two CTE tables JOINed after filtering |
| 5 | [Sub queries](./sub-queries/README.md) | Multiple queries describing nested `SELECT` statements |
| 6 | [UNION](./union/README.md) | A `UNION` peformed inside a CTE, followed by a `JOIN` |
| 7 | [Window functions](./window-functions/README.md) | Queries describing the use of `TUMBLE` and `HOP` |
| 8 | [TTL](./ttl/README.md) | Queries describing the use of `TTL` |
Binary file added sql/images/ttl_graph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
107 changes: 107 additions & 0 deletions sql/ttl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# TTL function example

This example shows how to use the Time To Live(TTL) function using Dozer SQL.

The TTL function provides a way to manage the memory usage in Dozer, particularly when dealing with vast streams of data. By setting up a TTL, it ensures that only relevant (or recent) data is held in memory, providing a balance between data retention and memory efficiency. TTL is based on the record's timestamp, ensuring that data eviction is contextually relevant.

To read more about window functions read the [documentation](https://getdozer.io/docs/transforming-data/windowing#ttl).

Here we describe two queries that will only use fresh data obtained over a 5 minute window,

- Query to calculate the sum of tips obtained for a particular Pickup location over a 2 minutes window.

- Query to calculate the sum of tips obtained for a particular Pickup location over a 3 minutes window but the windows overlap by 1 minutes.
i.e. the 3 minutes is divided into,
- 1 minutes overlapping with past window
- 1 minute non overlapping
- 1 minutes overlapping with next window

## SQL Query and Structure

### Query 1

```sql
SELECT t.PULocationID as location, SUM(t.tips) AS total_tips, t.window_start as start, t.window_end AS end
INTO table1
FROM TTL(TUMBLE(trips, pickup_datetime, '2 MINUTES'), pickup_datetime, '5 MINUTES') t
GROUP BY t.PULocationID, t.window_start, t.window_end;
```

### Query 2

```sql
SELECT t.PULocationID as location, SUM(t.tips) AS total_tips, t.window_start as start, t.window_end AS end
INTO table2
FROM TTL(HOP(trips, pickup_datetime, '1 MINUTE', '3 MINUTES'), pickup_datetime, '5 MINUTES') t
GROUP BY t.PULocationID, t.window_start, t.window_end;
```

![ttl_graph](../images/ttl_graph.png)

## Running

### Dozer

To run Dozer navigate to the join folder `/sql/ttl` & use the following command

```bash
dozer run
```

To remove the cache directory, use

```bash
dozer clean
```

### Dozer Live

To run with Dozer live, replace `run` with `live`

```bash
dozer live
```

Dozer live automatically deletes the cache upon stopping the program.

## Querying Dozer

Dozer API lets us use `filter`,`limit`,`order_by` and `skip` at the endpoints. For this example lets order the data in descending order of the sum of `tips`.

Execute the following commands over bash to get the results from `REST` and `gRPC` APIs.

### Query 1

**`REST`**

```bash
curl -X POST http://localhost:8080/tumble_ttl/query \
--header 'Content-Type: application/json' \
--data-raw '{"$order_by": {"total_tips": "desc"}}'
```

**`gRPC`**

```bash
grpcurl -d '{"endpoint": "tumble_ttl", "query": "{\"$order_by\": {\"total_tips\": \"desc\"}}"}' \
-plaintext localhost:50051 \
dozer.common.CommonGrpcService/query
```

### Query 2

**`REST`**

```bash
curl -X POST http://localhost:8080/hop_ttl/query \
--header 'Content-Type: application/json' \
--data-raw '{"$order_by": {"total_tips": "desc"}}'
```

**`gRPC`**

```bash
grpcurl -d '{"endpoint": "hop_ttl", "query": "{\"$order_by\": {\"total_tips\": \"desc\"}}"}' \
-plaintext localhost:50051 \
dozer.common.CommonGrpcService/query
```
46 changes: 46 additions & 0 deletions sql/ttl/dozer-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
app_name: ttl-sample
version: 1

connections:
- config: !LocalStorage
details:
path: ../data
tables:
- !Table
name: trips
config: !Parquet
path: trips
extension: .parquet
name: ny_taxi

sources:
- name: trips
table_name: trips
connection: ny_taxi

sql: |
-- get the total tips for each location in a 2 minute window
SELECT t.PULocationID as location, SUM(t.tips) AS total_tips, t.window_start as start, t.window_end AS end
INTO table1
FROM TTL(TUMBLE(trips, pickup_datetime, '2 MINUTES'), pickup_datetime, '5 MINUTES') t
GROUP BY t.PULocationID, t.window_start, t.window_end;
-- get the total tips for each location where every window of 3 minutes overlaps by 1 minutes
SELECT t.PULocationID as location, SUM(t.tips) AS total_tips, t.window_start as start, t.window_end AS end
INTO table2
FROM TTL(HOP(trips, pickup_datetime, '1 MINUTE', '3 MINUTES'), pickup_datetime, '5 MINUTES') t
GROUP BY t.PULocationID, t.window_start, t.window_end;
endpoints:
- name: tumble_ttl
path: /tumble_ttl
table_name: table1

- name: hop_ttl
path: /hop_ttl
table_name: table2

cache_max_map_size: 2147483648

0 comments on commit 923bd33

Please sign in to comment.