-
Notifications
You must be signed in to change notification settings - Fork 83
feat: Toil engine for CWL #317
feat: Toil engine for CWL #317
Conversation
Looks like my attempt to resolve the merge conflict on Github broke the tests; I will have to pull down the branch and fix the merge. Any advice for how to deal with the CDK call argument count tests? My changes conflict with any changes in |
@@ -0,0 +1,20 @@ | |||
--- | |||
name: Demo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use DemoCwl or similar so that the deployed artifacts have naming that is easier to follow in testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change this to be unique.
jobQueueArn: string; | ||
// This is actually a pattern that matches all ARNs for potentially relevant | ||
// definitions, since Toil makes its own definitions. | ||
toilJobArn: string; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toilJobDefinitionArnPattern would be a more meaningful name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will rename this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, a similar change may want to be made to cromwellJobArn
, which is also a pattern for engine-managed job definitions:
amazon-genomics-cli/packages/cdk/lib/roles/cromwell-engine-role.ts
Lines 17 to 26 in 8bda674
const cromwellJobArn = Arn.format( | |
{ | |
account: Aws.ACCOUNT_ID, | |
region: Aws.REGION, | |
partition: Aws.PARTITION, | |
resource: "job-definition/*", | |
service: "batch", | |
}, | |
scope as Stack | |
); |
new PolicyStatement({ | ||
effect: Effect.ALLOW, | ||
actions: ["batch:DescribeJobDefinitions", "batch:ListJobs", "batch:DescribeJobs", "batch:DescribeJobQueues", "batch:DescribeComputeEnvironments"], | ||
resources: ["*"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some of these can have a more limited scope. For example, you know the job queue so possibly batch:DescribeJobQueues
can be limited to that arn. You may also be able to limit batch:DescribeJobDefinitions
to the toilJobDeinitionArnPattern
. You may have to split the policies to do this but in general the number of actions with resources: ["*"]
should be as small as possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a copy-paste of the policy that Cromwell is using:
amazon-genomics-cli/packages/cdk/lib/roles/policies/cromwell-batch-policy.ts
Lines 13 to 17 in 8bda674
new PolicyStatement({ | |
effect: Effect.ALLOW, | |
actions: ["batch:DescribeJobDefinitions", "batch:ListJobs", "batch:DescribeJobs", "batch:DescribeJobQueues", "batch:DescribeComputeEnvironments"], | |
resources: ["*"], | |
}), |
Testing any change here is going to take something like an hour of turnaround time to tear down and redeploy the stack, so I'm very reluctant to undertake myself the project of determining what the minimum permission set to use Batch actually is.
I think instead of copy-pasting the code, I will try and base this off the Cromwell policy more directly, so that when the Cromwell permission set gets tightened, Toil will also be fixed.
new PolicyStatement({ | ||
effect: Effect.ALLOW, | ||
actions: ["sdb:*"], | ||
resources: ["*"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to know this arn and restrict to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ARNs that Toil needs follow a pattern that I will restrict to.
statements: [ | ||
new PolicyStatement({ | ||
effect: Effect.ALLOW, | ||
actions: ["sdb:*"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably more comments on this later but why not use DynamoDB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now Toil has code that requires SimpleDB. Now that S3 is strongly consistent, we have a project to remove the SimpleDB dependency and just use S3, but it is not yet complete, so currently Toil needs SimpleDB to work.
statements: [ | ||
new PolicyStatement({ | ||
effect: Effect.ALLOW, | ||
actions: ["s3:*"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is far too permissive. It could be used to do literally anything to a customers buckets. This must be limited to a pattern or single bucket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Toil is going to make a buckets and a SimpleDB domains with names starting with toil-cwl-
for each workflow, for scratch. (I think we might also use toil-
if the workflow isn't a CWL workflow.) It also I believe uses a toil
SimpleDB domain.
We probably should be able to grant permissions for just those prefixes.
This will cause trouble if the user wants to specify their own --jobStore
and tell Toil to use a different bucket, but we don't necessarily have to support that.
|
||
return { | ||
...commonBatchProps, | ||
// We only use one Batch from the stack for the Toil jobs. The server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only use one Batch compute environment and queue
|
||
export interface ToilEngineConstructProps extends EngineOptions { | ||
/** | ||
* AWS Batch JobQueue to use for running workflows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it used for running workflows or workflow tasks (or both)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just for the tasks; I can note that.
|
||
export class ToilEngineConstruct extends EngineConstruct { | ||
public readonly engine: SecureService; | ||
public readonly adapterRole: IRole; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Toil can speak WES without needing to use a Lambda why is there an adapterRole
and adapterLogGroup
? Perhaps these need renaming?
Is adapterRole
used at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left adapterRole
and adapterLogGroup
in under the impression that the type system required them to be there. Toil doesn't use them, and on looking at this again it seems like adapterRole
(now?) never makes it out to the base engine interface.
But EngineOutputs
always has an adapterLogGroup
right now, and the base EngineConstruct
defines a method that has to return an EngineOutputs
. I think to get rid of it I would need to not make ToilEngineConstruct
an EngineConstruct
, or else remove the field from the base EngineOutputs
and duplicate it and any code that uses it in all the other engines. Or else add a new intermediate AdaptedEngineOutputs
and AdaptedEngineConstruct
and change the other engines to use those.
TOIL_AWS_BATCH_JOB_ROLE_ARN: this.jobRole.roleArn, | ||
}); | ||
|
||
// TODO: Move log group creation into service construct and make it a property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything blocking this from happening right now? If not either move it or remove the TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment came along with the other code I duplicated from the Cromwell engine.
I could probably jut drop it here. There won't be an adapter log group to move if I manage to defeat the type system and get rid of it.
// TODO: Move log group creation into service construct and make it a property | ||
this.engine = this.getEngineServiceDefinition(props.vpc, engineContainer, this.engineLogGroup); | ||
// This is unused because we have no adapter, but a log group is required. | ||
this.adapterLogGroup = new LogGroup(this, "AdapterLogGroup"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to think about the adapterLogGroup would be the LogGroup for WES logs. The name reflects the notion that all of our engines don't understand WES natively. Is there a way that WES related logging from Toil can go here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would involve finding some kind of Python logging
module backend that could send logs to a CloudWatch log group instead of standard error, and adding it to Toil, and configuring it to pick up messages from WES-related modules, I think. It seems like making the notion of an adapter optional would be a lot easier.
protected getOutputs(): EngineOutputs { | ||
return { | ||
accessLogGroup: this.apiProxy.accessLogGroup, | ||
adapterLogGroup: this.adapterLogGroup, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you cannot make an adapterLogGroup
from Toil then you could make this field of EngineOutputs
optional or alternatively make a super class that doesn't have this property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, I suppose fields can be optional, maybe that is the right approach to take.
// their Docker images. | ||
engine := m.contextEnv.EngineName | ||
// Each engine has its own section in imageRefs, and for now we assume they | ||
// all care about a WES adapter image. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should assume this. Toils is an example of an engine with no WES Adapter image so rather than faking it we should handle the edge case where there is no WES adapter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make it support passing the WES adapter for the engines that need it only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Some comments on improvements that could be made. Most important is paring back the s3:*
on *
permission. This won't fly with security or with account admins who need to install AGC.
We will also need some basic additions to the github pages docs. This really just means adding a toil.md
file in site/content/en/docs/Workflow engines
. You can follow the basic pattern of the other engine pages.
@@ -46,7 +46,7 @@ func TestManager_Deploy(t *testing.T) { | |||
mockClients.ssmMock.EXPECT().GetCustomTags().Return(testTags) | |||
mockClients.ecrClientMock.EXPECT().VerifyImageExists(environment.CommonImages["NEXTFLOW"]).Return(nil) | |||
clearContext := mockClients.cdkMock.EXPECT().ClearContext(filepath.Join(testHomeDir, ".agc/cdk/apps/context")).Return(nil) | |||
mockClients.cdkMock.EXPECT().DeployApp(filepath.Join(testHomeDir, ".agc/cdk/apps/context"), gomock.Len(42), testContextName3).After(clearContext).Return(mockClients.progressStream1, nil) | |||
mockClients.cdkMock.EXPECT().DeployApp(filepath.Join(testHomeDir, ".agc/cdk/apps/context"), gomock.Len(30), testContextName3).After(clearContext).Return(mockClients.progressStream1, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this a constant so you don't have to change it multiple times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do that.
@@ -125,7 +126,7 @@ func (o *initProjectOpts) validateProject() error { | |||
func BuildProjectInitCommand() *cobra.Command { | |||
vars := initProjectVars{} | |||
cmd := &cobra.Command{ | |||
Use: "init project_name --workflow-type {wdl|nextflow|snakemake}", | |||
Use: "init project_name --workflow-type {cwl|wdl|nextflow|snakemake}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor ux nit, consider putting in alphabetic order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do that.
packages/engines/toil/Dockerfile
Outdated
# Add rabbitmq repository | ||
ADD rabbitmq.repo /etc/yum.repos.d/rabbitmq.repo | ||
|
||
# Sadly pre-importing keys doesn't seem to save any time whan we use yum later, so don't so it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/whan/when/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix this.
packages/engines/toil/README.md
Outdated
@@ -0,0 +1,48 @@ | |||
## Toil AWS Mirror | |||
|
|||
A Toil mono-container WES server for use with Amazon AGC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our marketing people prefer we use "Amazon Genomics CLI" in full in all user facing docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll call it that.
packages/engines/toil/README.md
Outdated
To push this to an Amazon ECR repo, where AGC can get at it, you can do something like: | ||
|
||
```bash | ||
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 318423852362.dkr.ecr.us-west-2.amazonaws.com |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommend changing --region us-west-2
to --region <your-deployment-region>
and remove account numbers and replace with either <your-account-number>
or 123456789012
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do this, although then I won't be able to copy-paste this and execute it anymore.
packages/engines/toil/README.md
Outdated
```bash | ||
aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 318423852362.dkr.ecr.us-west-2.amazonaws.com | ||
docker build -t adamnovak/toil-agc . | ||
docker tag adamnovak/toil-agc:latest 318423852362.dkr.ecr.us-west-2.amazonaws.com/adamnovak/toil-agc:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace account numbers and regions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do this.
33d9f46
to
216499e
Compare
c35cd8a
to
4a90087
Compare
@markjschreiber I've added some Toil documentation like you asked for. I generated the documentation and it looks right to me. |
As discussed at the meeting today, the AGC project is unwilling to take this without locking down the IAM actions granted to the Toil engine, beyond just restricting access to certain buckets/domains. I've added what I think is going to be close to the least-privilege set of permissions. I got these by looking at the full lists of possible permissions, and keeping the ones that appeared to correspond to Boto3 Client calls we make, or to Boto3 Resource objects or collections we access or things we do to those objects and collections. I still need to:
I don't think we'll be able to guarantee actual minimality with a reasonable time investment. We'd need to put together a workflow set that exercises every Boto3 call in Toil's job store, and try it once per permission, with that permission left out. Or we'd need to inventory every call into Boto3, trace them through the Boto3 Resource code to the API calls that will be made (accounting for any laziness), and then determine what IAM actions those API calls need allowed, either from the documentation or by experiment. For example, Toil at one point says:
Presumably we need It was mentioned that CloudWatch can log API operations, but I'm not sure that's the same as logging IAM action checks hit. |
5e276c4
to
c017154
Compare
OK, I had to make some changes to the permission set I initially proposed, but I've now tested this and it can successfully run the example |
A rebased version of this PR has been merged to HEAD. |
Issue #, if available: N/A
Description of Changes
This PR contains almost the minimum changes necessary to add Toil as an engine to run CWL.
It also adds the ability to override the Docker image names used when deploying a context (because this is necessary to actually use the Toil engine with Docker images with names in one's own namespace), and the ability to pass only relevant Docker images to the CDK (because otherwise the CDK commands grow with each engine and each new engine has to touch the part of the tests that counts how long the commands are).
Description of how you validated changes
I can't actually run this as submitted, because it doesn't have #231.
You can't actually read the logs of the successful workflow out, because pulling logs over WES is a separate feature (partly blocked by #314), and I'm going to include what I have for that in a separate PR.
But I tested commit 822e51a and I was able to run the included hello world workflow.
Checklist
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license