- Different TaskRoles compose a Framework
- Same Tasks compose a TaskRole
- A User Service executed by all Tasks in a TaskRole
-
Prepare Framework
-
Upload Framework Executable to HDFS
Upload the Example Framework Executable to HDFS:
hadoop fs -mkdir -p /ExampleFramework/ hadoop fs -put -f ExampleFramework.sh /ExampleFramework/
-
Write Framework Description File
Just use the Example Framework Description File.
Example Framework Description Explanation: • The Example Framework with Version 1 contains 1 TaskRole named LRSMaster. • LRSMaster contains 2 Tasks and they will be executed for LRSMaster's TaskService. • LRSMaster's TaskService with Version 1 is defined by its EntryPoint, SourceLocations and Resource. • The EntryPoint and SourceLocations defines the Service's corresponding Executable which needs to be ran inside Containers. • The Resource defines the Container Resource Guarantee / Limitation.
-
-
Launch Framework
Launcher Service need to be started before Launch Framework.
See README to Start Launcher Service
See Root URI to get {LauncherAddress}
HTTP PUT the Framework Description File as json to:
http://{LauncherAddress}/v1/Frameworks/ExampleFramework
For example, with curl, you can execute below cmd line:
curl -X PUT http://{LauncherAddress}/v1/Frameworks/ExampleFramework -d @ExampleFramework.json --header "Content-Type: application/json"
-
Monitor Framework
Check all the Requested Frameworks by:
http://{LauncherAddress}/v1/Frameworks
Check ExampleFramework by:
http://{LauncherAddress}/v1/Frameworks/ExampleFramework
LauncherInterfaces:
- RestAPI
- Submit Framework Description
LauncherService:
- One Central Service
- Manages all Frameworks for the whole Cluster.
LauncherAM:
- Per-Framework Service
- Manage Tasks for a single Framework by customized feature requirement
Launcher Service can be configured by LauncherConfiguration. You can check the Type, Specification and FeatureUsage inside it.
And we also provide a default configuration for you to refer: Default LauncherConfiguration File.
- All APIs are IDEMPOTENT and STATELESS, to allowed trivial Work Preserving Client Restart. In other words, User do not need to worry about call one API multiple times by different Client instance (such as Client Restart, etc).
- All APIs are DISTRIBUTED THREAD SAFE, to allow multiple distributed Client instances to access. In other words, User do not need to worry about call them at the same time in Multiple Threads/Processes/Nodes.
- LauncherService can only handle a finite, limited request volume. User should try to minimize its overall request frequency and payload, so that the LauncherService is not overloaded. To achieve this, User can centralize requests, space out requests, filter respond and so on.
- Completed Frameworks will ONLY be retained in recent FrameworkCompletedRetainSec, in case Client miss to delete the Framework after FrameworkCompleted. One exclusion is the Framework Launched by DataDeployment, it will be retained until the corresponding FrameworkDescriptionFile deleted in the DataDeployment. To avoid missing the CompletedFrameworkStatus, the polling interval seconds of Client should be less than FrameworkCompletedRetainSec. Check the FrameworkCompletedRetainSec by GET LauncherStatus.
Configure it as webServerAddress inside LauncherConfiguration File.
- Refer Data Model for the Type of HTTP Request and Response.
Headers | Description |
---|---|
UserName | Specifies which User send the Request. It is effective iff webServerAclEnable is true, see Framework ACL. |
Request
PUT /v1/Frameworks/{FrameworkName}
Type: application/json
Body: FrameworkDescriptor
Description
Add a NOT Requested Framework or Update a Requested Framework.
- Add a NOT Requested Framework: Framework will be Added and Launched (Now it is Requested).
- Update a Requested Framework:
- If FrameworkVersion unchanged:
- Framework will be Updated to the FrameworkDescription on the fly (i.e. Work Preserving).
- To Update Framework on the fly, it is better to use the corresponding PartialUpdate (such as PUT TaskNumber) than PUT the entire FrameworkDescription here. Because, partially update the FrameworkDescription can avoid the Race Condition (or Transaction Conflict) between two PUT Requests. Besides, the behaviour is undefined when change parameters in FrameworkDescription which is not supported by PartialUpdate.
- Else:
- Framework will be NonRolling Upgraded to new FrameworkVersion. (i.e. Not Work Preserving).
- NonRolling Upgrade can be used to change parameters in FrameworkDescription which is not supported by PartialUpdate (such as Framework Queue).
- NonRolling Upgrade should be triggered by change FrameworkVersion, instead of DELETE then PUT with the same FrameworkVersion.
- If FrameworkVersion unchanged:
- User is responsible and free to specify the FrameworkName of the Framework, however, the FrameworkName should respect the Framework ACL.
- After Accepted Response, its corresponding Status (such as FrameworkStatus and AggregatedFrameworkStatus) exists immediately, too. However, the Status may not be updated according to the Request (FrameworkDescriptor) immediately. So, to check whether it has been updated, Client still needs to poll the GET Status APIs.
Response
HttpStatusCode | Body | Description |
---|---|---|
Accepted(202) | NULL | The Request has been recorded for backend to process, not that the processing of the Request has been completed. |
BadRequest(400) | ExceptionMessage | The Request validation failed. So, Client is expected to not retry for this non-transient failure and then correct the Request. |
Forbidden(403) | ExceptionMessage | The Request authorization failed. So, Client is expected to not retry for this non-transient failure and then correct the Request or ask Administrator to grant the Request privilege. This Response may happen only if webServerAclEnable is true, see Framework ACL. |
TooManyRequests(429) | ExceptionMessage | The Request is rejected due to the New Total TaskNumber will exceed the Max Total TaskNumber if backend accepted it. So, the Client is expected to retry for this transient failure or migrate the whole Framework to another Cluster. |
ServiceUnavailable(503) | ExceptionMessage | The Request cannot be recorded for backend to process. In our system, this only happens when target Cluster's Zookeeper is down for a long time. So, the Client is expected to retry for this transient failure or migrate the whole Framework to another Cluster. |
Request
DELETE /v1/Frameworks/{FrameworkName}
Description
Delete a Framework, no matter it is Requested or not.
Notes:
- Framework will be Stopped and Deleted (Now it is NOT Requested).
- After Accepted Response, its corresponding Status does not exist immediately, too.
- Only recently completed Frameworks will be kept, if Client miss to DELETE the Framework after FrameworkCompleted. One exclusion is the Framework Launched by DataDeployment, it will be kept until the corresponding FrameworkDescriptionFile deleted in the DataDeployment.
Response
HttpStatusCode | Body | Description |
---|---|---|
Accepted(202) | NULL | Same as PUT Framework |
BadRequest(400) | ExceptionMessage | Same as PUT Framework |
Forbidden(403) | ExceptionMessage | Same as PUT Framework |
ServiceUnavailable(503) | ExceptionMessage | Same as PUT Framework |
Request
GET /v1/Frameworks/{FrameworkName}/FrameworkStatus
Description
Get the FrameworkStatus of a Requested Framework
Recipes:
- User Level RetryPolicy (Based on FrameworkState, ApplicationExitCode, ApplicationDiagnostic, ApplicationExitType)
- Directly Monitor Underlay YARN Application by YARN CLI or RestAPI (Based on ApplicationId or ApplicationTrackingUrl)
Response
HttpStatusCode | Body | Description |
---|---|---|
OK(200) | FrameworkStatus | |
NotFound(404) | ExceptionMessage | Specified Framework has not been Requested yet. So, Client is expected to not retry for this non-transient failure and then PUT the corresponding Framework first. |
ServiceUnavailable(503) | ExceptionMessage | Same as PUT Framework |
Request
PUT /v1/Frameworks/{FrameworkName}/TaskRoles/{TaskRoleName}/TaskNumber
Type: application/json
Body: UpdateTaskNumberRequest
Description
Update TaskNumber for a Requested Framework
Response
HttpStatusCode | Body | Description |
---|---|---|
Accepted(202) | NULL | Same as PUT Framework |
BadRequest(400) | ExceptionMessage | Same as PUT Framework |
Forbidden(403) | ExceptionMessage | Same as PUT Framework |
NotFound(404) | ExceptionMessage | Same as GET FrameworkStatus |
TooManyRequests(429) | ExceptionMessage | Same as PUT Framework |
ServiceUnavailable(503) | ExceptionMessage | Same as PUT Framework |
Request
PUT /v1/Frameworks/{FrameworkName}/ExecutionType
Type: application/json
Body: UpdateExecutionTypeRequest
Description
Update ExecutionType for a Requested Framework
Notes:
- It is ignored to change Framework ExecutionType from STOP to START, if previous STOP has already caused the Framework of current FrameworkVersion entered FINAL_STATES, i.e. FRAMEWORK_COMPLETED. So, to ensure starting the Framework again after STOP, just change the FrameworkVersion instead.
Response
HttpStatusCode | Body | Description |
---|---|---|
Accepted(202) | NULL | Same as PUT Framework |
BadRequest(400) | ExceptionMessage | Same as PUT Framework |
Forbidden(403) | ExceptionMessage | Same as PUT Framework |
NotFound(404) | ExceptionMessage | Same as GET FrameworkStatus |
ServiceUnavailable(503) | ExceptionMessage | Same as PUT Framework |
Request
PUT /v1/Frameworks/{FrameworkName}/MigrateTasks/{ContainerId}
Type: application/json
Body: MigrateTaskRequest
Description
Migrate a Task from current Container to another Container for a Requested Framework And new Container and old Container will satisfy the AntiAffinityLevel constraint.
Notes:
- User is responsible for implement Health/Perf Measurement of the Service based on Monitoring TaskStatuses or self-contained communication. And if found some Health/Perf degradations, User can migrate it by calling this API with corresponding ContainerId as parameter.
- Currently, only support Any AntiAffinityLevel.
Response
HttpStatusCode | Body | Description |
---|---|---|
Accepted(202) | NULL | Same as PUT Framework |
BadRequest(400) | ExceptionMessage | Same as PUT Framework |
Forbidden(403) | ExceptionMessage | Same as PUT Framework |
NotFound(404) | ExceptionMessage | Same as GET FrameworkStatus |
ServiceUnavailable(503) | ExceptionMessage | Same as PUT Framework |
Request
PUT /v1/Frameworks/{FrameworkName}/ApplicationProgress
Type: application/json
Body: OverrideApplicationProgressRequest
Description
Update ApplicationProgress for a Requested Framework
Notes:
- If User does not call this API. Default ApplicationProgress is used, and it is calculated as CompletedTaskCount / TotalTaskCount.
- User is responsible for implement Progress Measurement of the Service based on Monitoring Task logs or self-contained communication. And then feedback the Progress by calling this API to Override the default ApplicationProgress.
Response
HttpStatusCode | Body | Description |
---|---|---|
Accepted(202) | NULL | Same as PUT Framework |
BadRequest(400) | ExceptionMessage | Same as PUT Framework |
Forbidden(403) | ExceptionMessage | Same as PUT Framework |
NotFound(404) | ExceptionMessage | Same as GET FrameworkStatus |
ServiceUnavailable(503) | ExceptionMessage | Same as PUT Framework |
Request
GET /v1/Frameworks/{FrameworkName}
Description
Get the FrameworkInfo of a Requested Framework
FrameworkInfo = SummarizedFrameworkInfo + AggregatedFrameworkRequest + AggregatedFrameworkStatus
Response
HttpStatusCode | Body | Description |
---|---|---|
OK(200) | FrameworkInfo | |
NotFound(404) | ExceptionMessage | Same as GET FrameworkStatus |
ServiceUnavailable(503) | ExceptionMessage | Same as PUT Framework |
Request
GET /v1/Frameworks
QueryParameter | Description |
---|---|
UserName | Filter the result to only return Frameworks whose UserName equals the given value. |
Description
Get the SummarizedFrameworkInfos of all Requested Frameworks
A Framework's SummarizedFrameworkInfo consists selected fields from its Status and Request
Response
HttpStatusCode | Body | Description |
---|---|---|
OK(200) | SummarizedFrameworkInfos | |
BadRequest(400) | ExceptionMessage | Same as PUT Framework |
Forbidden(403) | ExceptionMessage | Same as PUT Framework |
ServiceUnavailable(503) | ExceptionMessage | Same as PUT Framework |
Request
GET /v1/Frameworks/{FrameworkName}/AggregatedFrameworkStatus
Description
Get the AggregatedFrameworkStatus of a Requested Framework
AggregatedFrameworkStatus = FrameworkStatus + all TaskRoles' (TaskRoleStatus + TaskStatuses)
TaskStatuses Recipes:
- ServiceDecovery (Based on TaskRoleName, ContainerHostName, ContainerIPAddress, ServiceId)
- TaskLogForwarding (Based on ContainerLogHttpAddress)
- MasterSlave and MigrateTask (Based on ContainerId)
- DataPartition (Based on TaskIndex) (Note TaskIndex will not change after Task Restart, Migrated or Upgraded)
Response
HttpStatusCode | Body | Description |
---|---|---|
OK(200) | AggregatedFrameworkStatus | |
NotFound(404) | ExceptionMessage | Same as GET FrameworkStatus |
ServiceUnavailable(503) | ExceptionMessage | Same as PUT Framework |
Request
GET /v1/Frameworks/{FrameworkName}/FrameworkRequest
Description
Get the FrameworkRequest of a Requested Framework
Current FrameworkDescriptor for the Framework is included in FrameworkRequest and it can reflect latest updates.
Response
HttpStatusCode | Body | Description |
---|---|---|
OK(200) | FrameworkRequest | |
NotFound(404) | ExceptionMessage | Same as GET FrameworkStatus |
ServiceUnavailable(503) | ExceptionMessage | Same as PUT Framework |
Request
GET /v1/Frameworks/{FrameworkName}/AggregatedFrameworkRequest
Description
Get the AggregatedFrameworkRequest of a Requested Framework
AggregatedFrameworkRequest = FrameworkRequest + all other feedback Request
Response
HttpStatusCode | Body | Description |
---|---|---|
OK(200) | AggregatedFrameworkRequest | |
NotFound(404) | ExceptionMessage | Same as GET FrameworkStatus |
ServiceUnavailable(503) | ExceptionMessage | Same as PUT Framework |
Request
GET /v1/LauncherRequest
Description
Get the LauncherRequest
Response
HttpStatusCode | Body | Description |
---|---|---|
OK(200) | LauncherRequest | |
ServiceUnavailable(503) | ExceptionMessage | Same as PUT Framework |
Request
GET /v1/LauncherStatus
Description
Get the LauncherStatus
Current LauncherConfiguration is included in LauncherStatus and it can reflect latest updates.
Response
HttpStatusCode | Body | Description |
---|---|---|
OK(200) | LauncherStatus | |
ServiceUnavailable(503) | ExceptionMessage | Same as PUT Framework |
Request
PUT /v1/LauncherRequest/ClusterConfiguration
Type: application/json
Body: ClusterConfiguration
Description
Update the ClusterConfiguration for all Frameworks on the fly
Besides the cluster information provided by YARN, Administrator can use this API to provide external information about current cluster configuration, which helps Launcher to schedule Task based on that. And below features depend on it:
- taskGpuType
Response
HttpStatusCode | Body | Description |
---|---|---|
Accepted(202) | NULL | Same as PUT Framework |
BadRequest(400) | ExceptionMessage | Same as PUT Framework |
Forbidden(403) | ExceptionMessage | Same as PUT Framework |
ServiceUnavailable(503) | ExceptionMessage | Same as PUT Framework |
Request
GET /v1/LauncherRequest/ClusterConfiguration
Description
Get the ClusterConfiguration
Response
HttpStatusCode | Body | Description |
---|---|---|
OK(200) | ClusterConfiguration | |
ServiceUnavailable(503) | ExceptionMessage | Same as PUT Framework |
Request
PUT /v1/LauncherRequest/AclConfiguration
Type: application/json
Body: AclConfiguration
Description
Update the AclConfiguration
It takes effects immediately after the Response iff webServerAclEnable is true, see Framework ACL.
Response
HttpStatusCode | Body | Description |
---|---|---|
OK(200) | NULL | |
BadRequest(400) | ExceptionMessage | Same as PUT Framework |
Forbidden(403) | ExceptionMessage | Same as PUT Framework |
ServiceUnavailable(503) | ExceptionMessage | Same as PUT Framework |
Request
GET /v1/LauncherRequest/AclConfiguration
Description
Get the AclConfiguration
Response
HttpStatusCode | Body | Description |
---|---|---|
OK(200) | AclConfiguration | |
ServiceUnavailable(503) | ExceptionMessage | Same as PUT Framework |
You can check the Type, Specification and FeatureUsage inside Launcher Data Model:
../src/main/java/com/microsoft/frameworklauncher/common/model/*
For example:
A Framework is Defined and Requested by FrameworkDescriptor data structure. To find the feature usage inside FrameworkDescriptor, you can refer the comment inside FrameworkDescriptor.
Launcher sets up below EnvironmentVariables for each User Service to use:
- Used to locate itself during the whole Framework life cycle. So, they will not be changed after Migration or Restart.
EnvironmentVariable | Description |
---|---|
LAUNCHER_ADDRESS | |
FRAMEWORK_NAME | |
FRAMEWORK_VERSION | |
TASKROLE_NAME | |
TASK_INDEX | |
SERVICE_NAME | |
SERVICE_VERSION |
- Used to locate itself during a specific execution attempt. So, they will be changed to a different one after Migration or Restart.
EnvironmentVariable | Description |
---|---|
APP_ID | Framework's current associated Application ID. |
CONTAINER_ID | Task's current associated Container ID. |
- Used to get the assigned Resource, only set when corresponding feature is enabled.
EnvironmentVariable | Description |
---|---|
CONTAINER_IP | Only set when generateInstanceHostList is enabled. |
CONTAINER_GPUS | Only set when gpuNumber is greater than 0. It is a Number, each bit of this number represents a Gpu. for example, 3 represents gpu0 and gpu1 |
CONTAINER_PORTS | Only set when portDefinitions is set, the format is: portLabel1:port1,port2,port3;portLabel2:port4,port5,port6 |
You can check the all the defined ExitStatus by: ExitType, RetryPolicyDescriptor, RetryPolicyState, ExitDiagnostics.
Recipes:
- Your LauncherClient can depend on the ExitStatus Convention
- If your Service failed, the Service can optionally return the ExitCode of USER_APP_TRANSIENT_ERROR and USER_APP_NON_TRANSIENT_ERROR to help FancyRetryPolicy to identify your Service's TRANSIENT_NORMAL and NON_TRANSIENT ExitType. If neither ExitCode is returned, the Service is considered to exit due to UNKNOWN ExitType.
Framework ACL specifies which Users/Groups are able to access a specific FrameworkName, no matter the FrameworkName exists or not. So, essentially, it is the ACL for Users/Groups against FrameworkName's Namespace.
Framework ACL helps to:
- Avoid one User/Group to occupy the FrameworkName (by Add Framework) reserved for other User/Group.
- Avoid one User/Group to modify the Framework (by Update Framework) launched by other User/Group.
- No naming conflict among UserNames and GroupNames.
- UserNames and GroupNames satisfy regex ^[A-Za-z0-9\-._]{1,254}$.
Framework ACL is enabled iff the webServerAclEnable is true, check it by GET LauncherStatus.
- If it is disabled, any User/Group can Read and Write the whole namespace. We assume it is enabled for simplicity below.
- Administrator can always Read and Write the whole namespace, this fact will be omitted for simplicity below.
Namespace Privilege:
Namespace Write Privilege: Add or Update Framework in the Namespace
Namespace Read Privilege: Get Framework in the Namespace
Namespace Mechanism:
{Namespace}~(AnyName)
- It is pre-created, so no need to create the namespace in advance.
- All Users/Groups can Read the namespace.
- Initially, only the User named {Namespace} can Write the namespace. To grant the Write Privilege to more Users, see PUT AclConfiguration.
- Based on the Assumption and Namespace Mechanism, the suggested Usage Pattern for Users/Groups can be derived below.
User Usage Pattern:
The private namespace for the User named {UserName} is:
{UserName}~(AnyName)
- It is pre-created, so no need to create the namespace in advance.
- Only this User can Write the namespace. However, all other Users/Groups can Read the namespace.
For example, User UA can fully control the namespace:
UA~(AnyName)
Group Usage Pattern:
The private namespace for the Group named {GroupName} is:
The shared namespace for the Users belongs to {GroupName} is:
{GroupName}~(AnyName)
- It is pre-created, so no need to create the namespace in advance.
- Initially, no User can Write the namespace. Administrator needs to add the UserNames belongs to the {GroupName} to the namespace {GroupName}, see PUT AclConfiguration. Only then, these Users can Write the namespace. However, all other Users/Groups can Read the namespace.
For example, Administrator adds User UA and UB belongs to Group GA to the namespace GA, and then UA and UB can work together to fully control the namespace:
GA~(AnyName)
Launcher does not enforce Users/Groups how to further partition his private namespace or how to avoid naming conflict within his private namespace. Different Users/Groups might choose to use his private namespace differently, depending on their exact requirement, scenario and assumption.
So, here, we just provide the best practices for two scenarios:
Batch Framework:
Batch Framework User tends to Add a new Framework each time instead of Update the existing one.
So, he should ensure the name is not reused within his private namespace each time to call PUT Framework.
Service Framework:
Service Framework User tends to frequently Update the existing Framework instead of Add a new one. And he tends to specify a well-known name which he might want to expose to the Users of the Service itself.
So, he should ensure the name is reused within his small set well-known Services each time to call PUT Framework.
Anyway, If he really want to ensure to Add a new one, he needs to check his small set well-known Services, and then pick a new one.
-
The Initial Working Directory of your EntryPoint is the root directory of the EntryPoint. Your Service can read data anywhere, however it can ONLY write data under the Initial Working Directory with the Service Directory excluded. And if the Source is a ZIP file, it will be uncompressed before starting your Service. For example:
EntryPoint=HbaseRS.zip/start.bat SourceLocations=hdfs:///HbaseRS.zip, hdfs:///HbaseCom <- HbaseRS.zip is a ZIP file
The two Sources HbaseRS.zip and HbaseCom will be downloaded (and uncompressed) to local machine as below structure:
./ <- The Initial Working Directory ├─HbaseRS.zip <- Service Directory <- HbaseRS.zip is a directory uncompressed from original ZIP file └─HbaseCom <- Service Directory
-
Launcher will not restart the succeeded Task (i.e. the process started by EntryPoint ends with exit code 0) in any RetryPolicy. So, if you want to always restart Service on the same machine irrespective of its exit code, you need to warp the original EntryPoint by another script, such as:
while true; do # call the original EntryPoint done
-
Increase the replication number your data and binary on target HDFS (Higher ReplicationNumber means faster downloading, higher availability and higher durability).
hadoop fs -setrep -w <ReplicationNumber> <HDFS Path>
-
Do not modify your data and binary on target HDFS. To use new data and binary, upload them to a different HDFS Path and then change the FrameworkVersion and SourceLocations.