Skip to content

Latest commit

 

History

History
37 lines (29 loc) · 1.87 KB

FATE_On_Spark.md

File metadata and controls

37 lines (29 loc) · 1.87 KB

Overview

Originally, the FATE use the underlying EggRoll as the underlying computing engine, the following picture illustrates the overview architecture.

As the above figure show, the EggRoll provide both computing and storage resource. However it will be little different while using different backend.

In FATE v1.5.0 a user can select Spark as the underlying computing engine, however, spark it self is a in-memory computing engine without the ability to persist data. Thus to use FATE on Spark a HDFS is also needed to be included to provide persistence capability. For example, a user need to upload their data to HDFS through FATE before further processing; the output data of every component will be also stored to the HDFS.

Currently the verifed Spark version is 2.4.1 and the Hadoop is 2.7.4

The following picture shows the architecture of FATE on Spark:

In current implementation, the fate_flow service uses the spark-submit binary tool to submit job to the Spark cluster. With the configuration of the fate's job, a user can also specify the configuration for the spark application, here is an example:

{
  "initiator": {
    "role": "guest",
    "party_id": 10000
  },
  "job_parameters": {
    "spark_run": {
      "executor-memory": "4G",
      "total-executor-cores": 4
    },
    ...

The above configuration limit the maximum memory and the cores that can be used by the executors. For more about the supported "spark_run" parameters please refer to this page