Skip to content

Commit

Permalink
Merge pull request pinot-contrib#279 from noramullen1/images-cleanup
Browse files Browse the repository at this point in the history
Give images more descriptive names
  • Loading branch information
noramullen1 authored Jan 22, 2024
2 parents ee82dcf + bb4d19a commit 0e41b8b
Show file tree
Hide file tree
Showing 13 changed files with 11 additions and 12 deletions.
File renamed without changes
Binary file removed .gitbook/assets/balanced.png
Binary file not shown.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
13 changes: 6 additions & 7 deletions operators/operating-pinot/instance-assignment.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ After configuring the server tags, the Tag-Based Instance Assignment can be enab

On top of the Tag-Based Instance Assignment, we can also control the number of servers assigned to each table by configuring the `numInstances` in the InstanceAssignmentConfig. This is useful when we want to serve multiple tables of different sizes on the same set of servers. For example, suppose we have 30 servers hosting hundreds of tables for different analytics, we don’t want to use all 30 servers for each table, especially the tiny tables with only megabytes of data.

![](../../.gitbook/assets/control.png)
![](../../.gitbook/assets/control-instance-assigment.png)

{% code title="TableConfig for Table 1:" %}
```javascript
Expand All @@ -88,7 +88,7 @@ On top of the Tag-Based Instance Assignment, we can also control the number of s

In order to use the [Replica-Group Segment Assignment](segment-assignment.md#replica-group-segment-assignment), the servers need to be assigned to multiple replica-groups of the table, where the Replica-Group Instance Assignment comes into the picture. Enable it and configure the `numReplicaGroups` and `numInstancesPerReplicaGroup` in the InstanceAssignmentConfig, and Pinot will assign the instances accordingly.

![](../../.gitbook/assets/replica.png)
![](../../.gitbook/assets/replica-instance-assignment.png)

{% code title="TableConfig for Table 1:" %}
```javascript
Expand Down Expand Up @@ -116,7 +116,7 @@ Similar to the Replica-Group Segment Assignment, in order to use the [Partitione

(Note: The `numPartitions` configured here does not have to match the actual number of partitions for the table in case the partitions of the table changed for some reason. If they do not match, the table partition will be assigned to the server partition in a round-robin fashion. For example, if there are 2 server partitions, but 4 table partitions, table partition 1 and 3 will be assigned to server partition 1, and table partition 2 and 4 will be assigned to server partition 2.)

![](../../.gitbook/assets/partition.png)
![](../../.gitbook/assets/partition-instance-assignment.png)

{% code title="TableConfig for Table 1:" %}
```javascript
Expand Down Expand Up @@ -150,11 +150,10 @@ For LLC real-time table, all the stream events are split into several stream par

Without explicitly configuring the replica-group based instance assignment, the replicas of the stream partitions will be evenly spread over all the available instances as shown in the following diagram:

![](../../.gitbook/assets/llc.png)
![](../../.gitbook/assets/low-level-consumer-assignment.png)

With replica-group based instance assignment, the stream partitions will be evenly spread over the instances within the replica-group:
With replica-group based instance assignment, the stream partitions will be evenly spread over the instances within the replica group.

![](../../.gitbook/assets/llc\_replica.png)

## Pool-Based Instance Assignment

Expand All @@ -166,7 +165,7 @@ To use the Pool-Based Instance Assignment, each server should be assigned to a p

(Note: A table can have more replicas than the number of pools for the cluster, in which case the replica-group will be assigned to the pools in a round-robin fashion, and the servers within a pool can host more than one replicas of the table. It is still okay to shut down the whole pool without bringing down the table because there are other replicas hosted by servers from other pools.)

![](../../.gitbook/assets/pool.png)
![](../../.gitbook/assets/pool-instance-assignment.png)

{% code title="Helix InstanceConfig for Server 1:" %}
```javascript
Expand Down
6 changes: 3 additions & 3 deletions operators/operating-pinot/segment-assignment.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Segment assignment refers to the strategy of assigning each segment from a table

Balanced Segment Assignment is the default assignment strategy, where each segment is assigned to the server with the least segments already assigned. With this strategy, each server will have balanced query load, and each query will be routed to all the servers. It requires minimum configuration, and works well for small use cases.

![](../../.gitbook/assets/Balanced.png)
![](../../.gitbook/assets/balanced-segment-assignment.png)

## Replica-Group Segment Assignment

Expand All @@ -22,15 +22,15 @@ Replica-Group Segment Assignment is introduced to solve the horizontal scalabili

When executing queries, each query will only be routed to the servers within the same replica-group. In order to scale up the cluster, more replica-groups can be added without affecting the fanout of the query, thus not impacting the query performance but increasing the overall throughput linearly.

![](../../.gitbook/assets/ReplicaGroup.png)
![](../../.gitbook/assets/replica-group-segment-assignment.png)

## Partitioned Replica-Group Segment Assignment

In order to further increase the query performance, we can reduce the number of segments processed for each query by partitioning the data and use the Partitioned Replica-Group Segment Assignment.

Partitioned Replica-Group Segment Assignment extends the Replica-Group Segment Assignment by assigning the segments from the same partition to the same set of servers. To solve a query which hits only one partition (e.g. `SELECT * FROM myTable WHERE memberId = 123` where `myTable` is partitioned with `memberId` column), the query only needs to be routed to the servers for the targeting partition, which can significantly reduce the number of segments to be processed. This strategy is especially useful to achieve high throughput and low latency for use cases that filter on an id field.

![](../../.gitbook/assets/Partitioned.png)
![](../../.gitbook/assets/partitioned-segment-assignement.png)

## Configure Segment Assignment

Expand Down
4 changes: 2 additions & 2 deletions operators/operating-pinot/tuning/routing.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,9 +51,9 @@ Apart from the ascending time, Apache Pinot can also take advantage of other dis
In order to make this pruning more efficient, segments should have the least number of partitions possible, which ideally is 1. More formally, given a function `p`, for all segments `s`, given any pair of rows `r1` and `r2`, it should be true that `p(r1) = p(r2)`. For example, in a table configured to have 3 partitions by `memberId` column, using `modulo` as the partition function, a segment that contains a row with `memberId` = 101 may also contain another row with `memberId` = 2 and another with `memberId` = 335, but it should not contain a row with `memberId` = 336 or `memberId` = 334.
{% endhint %}

Data cannot always be partitioned by a dimension column or even when it is, not all queries can take advantage of the distribution. But when this optimization can be applied, a lot of segments can be pruned. The current implementation for partitioning only works for **EQUALITY** and **IN** filter (e.g. `memberId = xx`, `memberId IN (x, y, z)`). Below diagram gives the example of data partitioned on member id while the query includes an equality filter on member id.
Data cannot always be partitioned by a dimension column or even when it is, not all queries can take advantage of the distribution. But when this optimization can be applied, a lot of segments can be pruned. The current implementation for partitioning only works for **EQUALITY** and **IN** filter (e.g. `memberId = xx`, `memberId IN (x, y, z)`). Below diagram gives the example of data partitioned on member ID while the query includes an equality filter on member ID.

![](../../../.gitbook/assets/partitioning.png)
![](../../../.gitbook/assets/partition-on-member-id.png)

Apache Pinot currently supports `Modulo`, `Murmur`, `ByteArray` and `HashCode` hash functions and partitioning can be enabled by setting the following configuration in the table config.

Expand Down

0 comments on commit 0e41b8b

Please sign in to comment.