From 4667c849867e42da6db1e0df39700e04bb79bd56 Mon Sep 17 00:00:00 2001 From: ClownXC Date: Thu, 19 Dec 2024 09:18:25 +0800 Subject: [PATCH] imap doc --- .../separated-cluster-deployment.md | 200 +++++++++--------- .../separated-cluster-deployment.md | 200 +++++++++--------- 2 files changed, 202 insertions(+), 198 deletions(-) diff --git a/docs/en/seatunnel-engine/separated-cluster-deployment.md b/docs/en/seatunnel-engine/separated-cluster-deployment.md index 91215eb459a..a4bb3bac4d1 100644 --- a/docs/en/seatunnel-engine/separated-cluster-deployment.md +++ b/docs/en/seatunnel-engine/separated-cluster-deployment.md @@ -182,105 +182,7 @@ seatunnel: classloader-cache-mode: true ``` -### 4.6 Persistence Configuration of IMap (This parameter is invalid on the Worker node) - -:::tip - -Since in the separated cluster mode, only the Master node stores IMap data and the Worker node does not store IMap data, the Worker service will not read this parameter item. - -::: - -In SeaTunnel, we use IMap (a distributed Map that can implement the writing and reading of data across nodes and processes. For detailed information, please refer to [hazelcast map](https://docs.hazelcast.com/imdg/4.2/data-structures/map)) to store the state of each task and its task, so that after the node where the task is located fails, the state information of the task before can be obtained on other nodes, thereby recovering the task and realizing the fault tolerance of the task. - -By default, the information of IMap is only stored in the memory, and we can set the number of replicas of IMap data. For specific reference (4.1 Setting the number of backups of data in IMap), if the number of replicas is 2, it means that each data will be simultaneously stored in 2 different nodes. Once the node fails, the data in IMap will be automatically replenished to the set number of replicas on other nodes. But when all nodes are stopped, the data in IMap will be lost. When the cluster nodes are started again, all previously running tasks will be marked as failed and need to be recovered manually by the user through the seatunnel.sh -r instruction. - -To solve this problem, we can persist the data in IMap to an external storage such as HDFS, OSS, etc. In this way, even if all nodes are stopped, the data in IMap will not be lost, and when the cluster nodes are started again, all previously running tasks will be automatically recovered. - -The following describes how to use the MapStore persistence configuration. For detailed information, please refer to [hazelcast map](https://docs.hazelcast.com/imdg/4.2/data-structures/map) - -**type** - -The type of IMap persistence, currently only supports `hdfs`. - -**namespace** - -It is used to distinguish the data storage locations of different businesses, such as the OSS bucket name. - -**clusterName** - -This parameter is mainly used for cluster isolation. We can use it to distinguish different clusters, such as cluster1, cluster2, which is also used to distinguish different businesses. - -**fs.defaultFS** - -We use the hdfs api to read and write files, so providing the hdfs configuration is required for using this storage. - -If you use HDFS, you can configure it like this: - -```yaml -map: - engine*: - map-store: - enabled: true - initial-mode: EAGER - factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory - properties: - type: hdfs - namespace: /tmp/seatunnel/imap - clusterName: seatunnel-cluster - storage.type: hdfs - fs.defaultFS: hdfs://localhost:9000 -``` - -If there is no HDFS and your cluster has only one node, you can configure it like this to use local files: - -```yaml -map: - engine*: - map-store: - enabled: true - initial-mode: EAGER - factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory - properties: - type: hdfs - namespace: /tmp/seatunnel/imap - clusterName: seatunnel-cluster - storage.type: hdfs - fs.defaultFS: file:/// -``` - -If you use OSS, you can configure it like this: - -```yaml -map: - engine*: - map-store: - enabled: true - initial-mode: EAGER - factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory - properties: - type: hdfs - namespace: /tmp/seatunnel/imap - clusterName: seatunnel-cluster - storage.type: oss - block.size: block size(bytes) - oss.bucket: oss://bucket name/ - fs.oss.accessKeyId: OSS access key id - fs.oss.accessKeySecret: OSS access key secret - fs.oss.endpoint: OSS endpoint -``` - -Notice: When using OSS, make sure that the following jars are in the lib directory. - -``` -aliyun-sdk-oss-3.13.2.jar -hadoop-aliyun-3.3.6.jar -jdom2-2.0.6.jar -netty-buffer-4.1.89.Final.jar -netty-common-4.1.89.Final.jar -seatunnel-hadoop3-3.1.4-uber.jar -``` - -### 4.7 Job Scheduling Strategy +### 4.6 Job Scheduling Strategy When resources are insufficient, the job scheduling strategy can be configured in the following two modes: @@ -382,6 +284,106 @@ TCP is the way we recommend to use in a standalone SeaTunnel Engine cluster. On the other hand, Hazelcast provides some other service discovery methods. For details, please refer to [hazelcast network](https://docs.hazelcast.com/imdg/4.1/clusters/setting-up-clusters). +### 5.3 Persistence Configuration of IMap (This parameter is invalid on the Worker node) + +:::tip + +Since in the separated cluster mode, only the Master node stores IMap data and the Worker node does not store IMap data, the Worker service will not read this parameter item. + +::: + +In SeaTunnel, we use IMap (a distributed Map that can implement the writing and reading of data across nodes and processes. For detailed information, please refer to [hazelcast map](https://docs.hazelcast.com/imdg/4.2/data-structures/map)) to store the state of each task and its task, so that after the node where the task is located fails, the state information of the task before can be obtained on other nodes, thereby recovering the task and realizing the fault tolerance of the task. + +By default, the information of IMap is only stored in the memory, and we can set the number of replicas of IMap data. For specific reference (4.1 Setting the number of backups of data in IMap), if the number of replicas is 2, it means that each data will be simultaneously stored in 2 different nodes. Once the node fails, the data in IMap will be automatically replenished to the set number of replicas on other nodes. But when all nodes are stopped, the data in IMap will be lost. When the cluster nodes are started again, all previously running tasks will be marked as failed and need to be recovered manually by the user through the seatunnel.sh -r instruction. + +To solve this problem, we can persist the data in IMap to an external storage such as HDFS, OSS, etc. In this way, even if all nodes are stopped, the data in IMap will not be lost, and when the cluster nodes are started again, all previously running tasks will be automatically recovered. + +The following describes how to use the MapStore persistence configuration. For detailed information, please refer to [hazelcast map](https://docs.hazelcast.com/imdg/4.2/data-structures/map) + +**type** + +The type of IMap persistence, currently only supports `hdfs`. + +**namespace** + +It is used to distinguish the data storage locations of different businesses, such as the OSS bucket name. + +**clusterName** + +This parameter is mainly used for cluster isolation. We can use it to distinguish different clusters, such as cluster1, cluster2, which is also used to distinguish different businesses. + +**fs.defaultFS** + +We use the hdfs api to read and write files, so providing the hdfs configuration is required for using this storage. + +If you use HDFS, you can configure it like this: + +```yaml +map: + engine*: + map-store: + enabled: true + initial-mode: EAGER + factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory + properties: + type: hdfs + namespace: /tmp/seatunnel/imap + clusterName: seatunnel-cluster + storage.type: hdfs + fs.defaultFS: hdfs://localhost:9000 +``` + +If there is no HDFS and your cluster has only one node, you can configure it like this to use local files: + +```yaml +map: + engine*: + map-store: + enabled: true + initial-mode: EAGER + factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory + properties: + type: hdfs + namespace: /tmp/seatunnel/imap + clusterName: seatunnel-cluster + storage.type: hdfs + fs.defaultFS: file:/// +``` + +If you use OSS, you can configure it like this: + +```yaml +map: + engine*: + map-store: + enabled: true + initial-mode: EAGER + factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory + properties: + type: hdfs + namespace: /tmp/seatunnel/imap + clusterName: seatunnel-cluster + storage.type: oss + block.size: block size(bytes) + oss.bucket: oss://bucket name/ + fs.oss.accessKeyId: OSS access key id + fs.oss.accessKeySecret: OSS access key secret + fs.oss.endpoint: OSS endpoint +``` + +Notice: When using OSS, make sure that the following jars are in the lib directory. + +``` +aliyun-sdk-oss-3.13.2.jar +hadoop-aliyun-3.3.6.jar +jdom2-2.0.6.jar +netty-buffer-4.1.89.Final.jar +netty-common-4.1.89.Final.jar +seatunnel-hadoop3-3.1.4-uber.jar +``` + + + ## 6. Starting the SeaTunnel Engine Master Node It can be started using the `-d` parameter through the daemon. diff --git a/docs/zh/seatunnel-engine/separated-cluster-deployment.md b/docs/zh/seatunnel-engine/separated-cluster-deployment.md index bdc369ff8c0..e39b9604a39 100644 --- a/docs/zh/seatunnel-engine/separated-cluster-deployment.md +++ b/docs/zh/seatunnel-engine/separated-cluster-deployment.md @@ -186,105 +186,7 @@ seatunnel: classloader-cache-mode: true ``` -### 4.6 IMap持久化配置(该参数在Worker节点无效) - -:::tip - -由于在分离集群模式下,只有Master节点存储Imap数据,Worker节点不存储Imap数据,所以Worker服务不会读取该参数项。 - -::: - -在SeaTunnel中,我们使用IMap(一种分布式的Map,可以实现数据跨节点跨进程的写入的读取 有关详细信息,请参阅 [Hazelcast Map](https://docs.hazelcast.com/imdg/4.2/data-structures/map)) 来存储每个任务及其task的状态,以便在任务所在节点宕机后,可以在其他节点上获取到任务之前的状态信息,从而恢复任务实现任务的容错。 - -默认情况下Imap的信息只是存储在内存中,我们可以设置Imap数据的复本数,具体可参考(4.1 Imap中数据的备份数设置),如果复本数是2,代表每个数据会同时存储在2个不同的节点中。一旦节点宕机,Imap中的数据会重新在其它节点上自动补充到设置的复本数。但是当所有节点都被停止后,Imap中的数据会丢失。当集群节点再次启动后,所有之前正在运行的任务都会被标记为失败,需要用户手工通过seatunnel.sh -r 指令恢复运行。 - -为了解决这个问题,我们可以将Imap中的数据持久化到外部存储中,如HDFS、OSS等。这样即使所有节点都被停止,Imap中的数据也不会丢失,当集群节点再次启动后,所有之前正在运行的任务都会被自动恢复。 - -下面介绍如何使用 MapStore 持久化配置。有关详细信息,请参阅 [Hazelcast Map](https://docs.hazelcast.com/imdg/4.2/data-structures/map) - -**type** - -imap 持久化的类型,目前仅支持 `hdfs`。 - -**namespace** - -它用于区分不同业务的数据存储位置,如 OSS 存储桶名称。 - -**clusterName** - -此参数主要用于集群隔离, 我们可以使用它来区分不同的集群,如 cluster1、cluster2,这也用于区分不同的业务。 - -**fs.defaultFS** - -我们使用 hdfs api 读写文件,因此使用此存储需要提供 hdfs 配置。 - -如果您使用 HDFS,可以像这样配置: - -```yaml -map: - engine*: - map-store: - enabled: true - initial-mode: EAGER - factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory - properties: - type: hdfs - namespace: /tmp/seatunnel/imap - clusterName: seatunnel-cluster - storage.type: hdfs - fs.defaultFS: hdfs://localhost:9000 -``` - -如果没有 HDFS,并且您的集群只有一个节点,您可以像这样配置使用本地文件: - -```yaml -map: - engine*: - map-store: - enabled: true - initial-mode: EAGER - factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory - properties: - type: hdfs - namespace: /tmp/seatunnel/imap - clusterName: seatunnel-cluster - storage.type: hdfs - fs.defaultFS: file:/// -``` - -如果您使用 OSS,可以像这样配置: - -```yaml -map: - engine*: - map-store: - enabled: true - initial-mode: EAGER - factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory - properties: - type: hdfs - namespace: /tmp/seatunnel/imap - clusterName: seatunnel-cluster - storage.type: oss - block.size: block size(bytes) - oss.bucket: oss://bucket name/ - fs.oss.accessKeyId: OSS access key id - fs.oss.accessKeySecret: OSS access key secret - fs.oss.endpoint: OSS endpoint -``` - -注意:使用OSS 时,确保 lib目录下有这几个jar. - -``` -aliyun-sdk-oss-3.13.2.jar -hadoop-aliyun-3.3.6.jar -jdom2-2.0.6.jar -netty-buffer-4.1.89.Final.jar -netty-common-4.1.89.Final.jar -seatunnel-hadoop3-3.1.4-uber.jar -``` - -### 4.7 作业调度策略 +### 4.6 作业调度策略 当资源不足时,作业调度策略可以配置为以下两种模式: @@ -388,6 +290,106 @@ TCP 是我们建议在独立 SeaTunnel Engine 集群中使用的方式。 另一方面,Hazelcast 提供了一些其他的服务发现方法。有关详细信息,请参阅 [Hazelcast Network](https://docs.hazelcast.com/imdg/4.1/clusters/setting-up-clusters) +### 5.3 IMap持久化配置(该参数在Worker节点无效) + +:::tip + +由于在分离集群模式下,只有Master节点存储Imap数据,Worker节点不存储Imap数据,所以Worker服务不会读取该参数项。 + +::: + +在SeaTunnel中,我们使用IMap(一种分布式的Map,可以实现数据跨节点跨进程的写入的读取 有关详细信息,请参阅 [Hazelcast Map](https://docs.hazelcast.com/imdg/4.2/data-structures/map)) 来存储每个任务及其task的状态,以便在任务所在节点宕机后,可以在其他节点上获取到任务之前的状态信息,从而恢复任务实现任务的容错。 + +默认情况下Imap的信息只是存储在内存中,我们可以设置Imap数据的复本数,具体可参考(4.1 Imap中数据的备份数设置),如果复本数是2,代表每个数据会同时存储在2个不同的节点中。一旦节点宕机,Imap中的数据会重新在其它节点上自动补充到设置的复本数。但是当所有节点都被停止后,Imap中的数据会丢失。当集群节点再次启动后,所有之前正在运行的任务都会被标记为失败,需要用户手工通过seatunnel.sh -r 指令恢复运行。 + +为了解决这个问题,我们可以将Imap中的数据持久化到外部存储中,如HDFS、OSS等。这样即使所有节点都被停止,Imap中的数据也不会丢失,当集群节点再次启动后,所有之前正在运行的任务都会被自动恢复。 + +下面介绍如何使用 MapStore 持久化配置。有关详细信息,请参阅 [Hazelcast Map](https://docs.hazelcast.com/imdg/4.2/data-structures/map) + +**type** + +imap 持久化的类型,目前仅支持 `hdfs`。 + +**namespace** + +它用于区分不同业务的数据存储位置,如 OSS 存储桶名称。 + +**clusterName** + +此参数主要用于集群隔离, 我们可以使用它来区分不同的集群,如 cluster1、cluster2,这也用于区分不同的业务。 + +**fs.defaultFS** + +我们使用 hdfs api 读写文件,因此使用此存储需要提供 hdfs 配置。 + +如果您使用 HDFS,可以像这样配置: + +```yaml +map: + engine*: + map-store: + enabled: true + initial-mode: EAGER + factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory + properties: + type: hdfs + namespace: /tmp/seatunnel/imap + clusterName: seatunnel-cluster + storage.type: hdfs + fs.defaultFS: hdfs://localhost:9000 +``` + +如果没有 HDFS,并且您的集群只有一个节点,您可以像这样配置使用本地文件: + +```yaml +map: + engine*: + map-store: + enabled: true + initial-mode: EAGER + factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory + properties: + type: hdfs + namespace: /tmp/seatunnel/imap + clusterName: seatunnel-cluster + storage.type: hdfs + fs.defaultFS: file:/// +``` + +如果您使用 OSS,可以像这样配置: + +```yaml +map: + engine*: + map-store: + enabled: true + initial-mode: EAGER + factory-class-name: org.apache.seatunnel.engine.server.persistence.FileMapStoreFactory + properties: + type: hdfs + namespace: /tmp/seatunnel/imap + clusterName: seatunnel-cluster + storage.type: oss + block.size: block size(bytes) + oss.bucket: oss://bucket name/ + fs.oss.accessKeyId: OSS access key id + fs.oss.accessKeySecret: OSS access key secret + fs.oss.endpoint: OSS endpoint +``` + +注意:使用OSS 时,确保 lib目录下有这几个jar. + +``` +aliyun-sdk-oss-3.13.2.jar +hadoop-aliyun-3.3.6.jar +jdom2-2.0.6.jar +netty-buffer-4.1.89.Final.jar +netty-common-4.1.89.Final.jar +seatunnel-hadoop3-3.1.4-uber.jar +``` + + + ## 6. 启动 SeaTunnel Engine Master 节点 可以通过守护进程使用 `-d` 参数启动。