Skip to content

Commit

Permalink
Merge pull request #14 from DanRoscigno/update2
Browse files Browse the repository at this point in the history
update
  • Loading branch information
DanRoscigno authored Oct 6, 2023
2 parents 27f8dd3 + 91b87a0 commit 672b19e
Show file tree
Hide file tree
Showing 17 changed files with 963 additions and 372 deletions.
2 changes: 1 addition & 1 deletion versioned_docs/version-3.0/administration/Configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -1769,7 +1769,7 @@ BE static parameters are as follows.

#### user_function_dir

- **Default**: `${STARROCKS_HOME}/lib/udfi`
- **Default**: `${STARROCKS_HOME}/lib/udf`
- **Unit**: N/A
- **Description**: The directory used to store User-defined Functions (UDFs).

Expand Down
2 changes: 1 addition & 1 deletion versioned_docs/version-3.1/administration/Configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -1752,7 +1752,7 @@ BE static parameters are as follows.

#### user_function_dir

- **Default**: `${STARROCKS_HOME}/lib/udfi`
- **Default**: `${STARROCKS_HOME}/lib/udf`
- **Unit**: N/A
- **Description**: The directory used to store User-defined Functions (UDFs).

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@

**Before starting CNs**, add the following configuration items in the CN configuration file **cn.conf**:

```Properties
starlet_port = <starlet_port>
storage_root_path = <storage_root_path>
```

#### starlet_port

The CN heartbeat service port for the StarRocks shared-data cluster. Default value: `9070`.

#### storage_root_path

The storage volume directory that the local cached data depends on and the medium type of the storage. Multiple volumes are separated by semicolon (;). If the storage medium is SSD, add `,medium:ssd` at the end of the directory. If the storage medium is HDD, add `,medium:hdd` at the end of the directory. Example: `/data1,medium:hdd;/data2,medium:ssd`.

The default value for `storage_root_path` is `${STARROCKS_HOME}/storage`.

Local cache is effective when queries are frequent and the data being queried is recent, but there are cases that you may wish to turn off the local cache completely.

- In a Kubernetes environment with CN pods that scale up and down in number on demand, the pods may not have storage volumes attached.
- When the data being queried is in a data lake in remote storage and most of it is archive (old) data. If the queries are infrequent the data cache will have a low hit ratio and the benefit may not be worth having the cache.

To turn off the data cache set:

```Properties
storage_root_path =
```

> **NOTE**
>
> The data is cached under the directory **`<storage_root_path>/starlet_cache`**.
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
This topic describes how to deploy and use a shared-data StarRocks cluster. This feature is supported from v3.0 for S3 compatible storage and v3.1 for Azure Blob Storage.

> **NOTE**
>
> StarRocks version 3.1 brings some changes to the shared-data deployment and configuration. Please use this document if you are running version 3.1 or higher.
>
> If you are running version 3.0 please use the
[3.0 documentation](https://docs.starrocks.io/en-us/3.0/deployment/deploy_shared_data).

The shared-data StarRocks cluster is specifically engineered for the cloud on the premise of separation of storage and compute. It allows data to be stored in object storage (for example, AWS S3, Google GCS, Azure Blob Storage, and MinIO). You can achieve not only cheaper storage and better resource isolation, but elastic scalability for your cluster. The query performance of the shared-data StarRocks cluster aligns with that of a shared-nothing StarRocks cluster when the local disk cache is hit.

In version 3.1 and higher the StarRocks shared-data cluster is made up of Frontend Engines (FEs) and Compute Nodes (CNs). The CNs replace the classic Backend Engines (BEs) in shared-data clusters.

Compared to the classic shared-nothing StarRocks architecture, separation of storage and compute offers a wide range of benefits. By decoupling these components, StarRocks provides:

- Inexpensive and seamlessly scalable storage.
- Elastic scalable compute. Because data is not stored in Compute Nodes (CNs), scaling can be done without data migration or shuffling across nodes.
- Local disk cache for hot data to boost query performance.
- Asynchronous data ingestion into object storage, allowing a significant improvement in loading performance.

98 changes: 98 additions & 0 deletions versioned_docs/version-3.1/assets/commonMarkdown/sharedDataUse.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@

For more information on how to create a storage volume for other object storages and set the default storage volume, see [CREATE STORAGE VOLUME](../../sql-reference/sql-statements/Administration/CREATE%20STORAGE%20VOLUME.md) and [SET DEFAULT STORAGE VOLUME](../../sql-reference/sql-statements/Administration/SET%20DEFAULT%20STORAGE%20VOLUME.md).

### Create a database and a cloud-native table

After you create a default storage volume, you can then create a database and a cloud-native table using this storage volume.

Currently, shared-data StarRocks clusters support the following table types:

- Duplicate Key table
- Aggregate table
- Unique Key table
- Primary Key table (Currently, the primary key persistent index is not supported.)

The following example creates a database `cloud_db` and a table `detail_demo` based on Duplicate Key table type, enables the local disk cache, sets the hot data validity duration to one month, and disables asynchronous data ingestion into object storage:

```SQL
CREATE DATABASE cloud_db;
USE cloud_db;
CREATE TABLE IF NOT EXISTS detail_demo (
recruit_date DATE NOT NULL COMMENT "YYYY-MM-DD",
region_num TINYINT COMMENT "range [-128, 127]",
num_plate SMALLINT COMMENT "range [-32768, 32767] ",
tel INT COMMENT "range [-2147483648, 2147483647]",
id BIGINT COMMENT "range [-2^63 + 1 ~ 2^63 - 1]",
password LARGEINT COMMENT "range [-2^127 + 1 ~ 2^127 - 1]",
name CHAR(20) NOT NULL COMMENT "range char(m),m in (1-255) ",
profile VARCHAR(500) NOT NULL COMMENT "upper limit value 65533 bytes",
ispass BOOLEAN COMMENT "true/false")
DUPLICATE KEY(recruit_date, region_num)
DISTRIBUTED BY HASH(recruit_date, region_num)
PROPERTIES (
"storage_volume" = "def_volume",
"datacache.enable" = "true",
"datacache.partition_duration" = "1 MONTH",
"enable_async_write_back" = "false"
);
```

> **NOTE**
>
> The default storage volume is used when you create a database or a cloud-native table in a shared-data StarRocks cluster if no storage volume is specified.
In addition to the regular table `PROPERTIES`, you need to specify the following `PROPERTIES` when creating a table for shared-data StarRocks cluster:

#### datacache.enable

Whether to enable the local disk cache.

- `true` (Default) When this property is set to `true`, the data to be loaded is simultaneously written into the object storage and the local disk (as the cache for query acceleration).
- `false` When this property is set to `false`, the data is loaded only into the object storage.

> **NOTE**
>
> In version 3.0 this property was named `enable_storage_cache`.
>
> To enable the local disk cache, you must specify the directory of the disk in the CN configuration item `storage_root_path`.
#### datacache.partition_duration

The validity duration of the hot data. When the local disk cache is enabled, all data is loaded into the cache. When the cache is full, StarRocks deletes the less recently used data from the cache. When a query needs to scan the deleted data, StarRocks checks if the data is within the duration of validity. If the data is within the duration, StarRocks loads the data into the cache again. If the data is not within the duration, StarRocks does not load it into the cache. This property is a string value that can be specified with the following units: `YEAR`, `MONTH`, `DAY`, and `HOUR`, for example, `7 DAY` and `12 HOUR`. If it is not specified, all data is cached as the hot data.

> **NOTE**
>
> In version 3.0 this property was named `storage_cache_ttl`.
>
> This property is available only when `datacache.enable` is set to `true`.
#### enable_async_write_back

Whether to allow data to be written into object storage asynchronously. Default: `false`.
- `true` When this property is set to `true`, the load task returns success as soon as the data is written into the local disk cache, and the data is written into the object storage asynchronously. This allows better loading performance, but it also risks data reliability under potential system failures.
- `false` (Default) When this property is set to `false`, the load task returns success only after the data is written into both object storage and the local disk cache. This guarantees higher availability but leads to lower loading performance.

### View table information

You can view the information of tables in a specific database using `SHOW PROC "/dbs/<db_id>"`. See [SHOW PROC](../../sql-reference/sql-statements/Administration/SHOW%20PROC.md) for more information.

Example:

```Plain
mysql> SHOW PROC "/dbs/xxxxx";
+---------+-------------+----------+---------------------+--------------+--------+--------------+--------------------------+--------------+---------------+------------------------------+
| TableId | TableName | IndexNum | PartitionColumnName | PartitionNum | State | Type | LastConsistencyCheckTime | ReplicaCount | PartitionType | StoragePath |
+---------+-------------+----------+---------------------+--------------+--------+--------------+--------------------------+--------------+---------------+------------------------------+
| 12003 | detail_demo | 1 | NULL | 1 | NORMAL | CLOUD_NATIVE | NULL | 8 | UNPARTITIONED | s3://xxxxxxxxxxxxxx/1/12003/ |
+---------+-------------+----------+---------------------+--------------+--------+--------------+--------------------------+--------------+---------------+------------------------------+
```

The `Type` of a table in shared-data StarRocks cluster is `CLOUD_NATIVE`. In the field `StoragePath`, StarRocks returns the object storage directory where the table is stored.

### Load data into a shared-data StarRocks cluster

Shared-data StarRocks clusters support all loading methods provided by StarRocks. See [Overview of data loading](../../loading/Loading_intro.md) for more information.

### Query in a shared-data StarRocks cluster

Tables in a shared-data StarRocks cluster support all types of queries provided by StarRocks. See StarRocks [SELECT](../../sql-reference/sql-statements/data-manipulation/SELECT.md) for more information.
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@

The usage of shared-data StarRocks clusters is also similar to that of a classic shared-nothing StarRocks cluster, except that the shared-data cluster uses storage volumes and cloud-native tables to store data in object storage.

### Create default storage volume

You can use the built-in storage volumes that StarRocks automatically creates, or you can manually create and set the default storage volume. This section describes how to manually create and set the default storage volume.

> **NOTE**
>
> If your shared-data StarRocks cluster is upgraded from v3.0, you do not need to define a default storage volume because StarRocks created one with the object storage-related properties you specified in the FE configuration file **fe.conf**. You can still create new storage volumes with other object storage resources and set the default storage volume differently.
To give your shared-data StarRocks cluster permission to store data in your object storage, you must reference a storage volume when you create databases or cloud-native tables. A storage volume consists of the properties and credential information of the remote data storage. If you have deployed a new shared-data StarRocks cluster and disallow StarRocks to create a built-in storage volume (by specifying `enable_load_volume_from_conf` as `false`), you must define a default storage volume before you can create databases and tables in the cluster.

Loading

0 comments on commit 672b19e

Please sign in to comment.