The Amazon Genomics CLI (agc
) allows users to orchestrate workflow execution using AWS Batch. See the Workbench documentation for information on installing and using the agc
to configure and run workflows. The following section provides additional information on deploying a project using the agc
.
Once you have installed and authenticated with the agc
, you can deploy a context using an agc project YAML file. This file must be named agc-project.yaml
.
An example agc-project.yaml file that has the workflow, reference data source, and both on-demand and spot contexts configured using Cromwell as the engine is provided here. This will create an agc project named humanwgsAGC
, with either (or both) a spotContext
or an onDemandContext
. The spotContext
will allow you to run worklfows using AWS spot instances, which can result in substantial cost savings relative to using on-demand instances.
Note that deploying a context will incur costs even if you are not actively running workflows; ensure that contexts that are not in use are destroyed to avoid incurring ongoing costs.
To deploy the agc project using the template file, first copy the template file to a file named agc-project.yaml
(cp agc-project.template.yaml agc-project.yaml
).
In the data
section of the agc-project.yaml
file, add any additional s3 buckets that the workflow will require access to, for example the bucket containing sample input data. Make sure that you do not remove the section granting access to the s3://dnastack-resources bucket; this is where reference datasets are hosted.
data:
- location: s3://dnastack-resources
readOnly: true
- location: s3://<sample_data_bucket_name>
readOnly: true
Then from the directory containing the agc-project.yaml
file, run:
agc context deploy --context ${context}
Where ${context}
is either spotContext
or onDemandContext
.
If you want both spot and on-demand contexts, all contexts can be deployed at once by running:
agc context deploy --all
Note that the miniwdl
engine run via AWS is currently not supported for this workflow.
See resources requirements for information on the minimum requirements for running the workflow. Typically in a new AWS environment, additional vCPU quota will be required.
- Navigate to the AWS console.
- In the top right corner, select the region where your
agc
deployment is located. - Navigate to EC2.
- In the menu on the left, select 'Limits'.
- Filter the limits by searching for "Standard". The current limit field indicates the number of vCPUs that you currently have access to.
- Spot instance limit:
All Standard (A, C, D, H, I, M, R, T, Z) Spot Instance Requests
- On-demand instance limit:
Running On-Demand All Standard (A, C, D, H, I, M, R, T, Z) instances
If the number of vCPUs in the context you plan to run the workflow in is less than the limites specified in the resources requirements section, you will need to request additional quota before you can run the workflow.
- Continuing from the steps outlined in checking the current quota, select the service you want to request an increase for.
- In the top right corner, select 'Request limit increase'.
- Fill out the appropriate fields in the request form, ensuring that the region you select is the region where you have deployed your
agc
and where your data is located. 256 vCPUs are recommended for running trio data.
Low quota increase requests are typically fulfilled within a 1-2 hours.
Fill out any information missing in the inputs file. Ensure that all data files used by the workflow are at locations that have been configured in the agc-project.yaml file; see the granting access to other data files for more information.
See the inputs section of the main README for more information on the structure of the inputs.json file.
Note that you only need to fill out the queueArn corresponding to the context you are submitting the workflow to (spot or on-demand).
To determine available zones in AWS, look for the ZoneName
attribute output by the following command:
aws ec2 describe-availability-zones --region <region>
For example, the zones in region us-east-2 are "us-east-2a us-east-2b us-east-2c"
.
Note that if you are using a miniwdl
engine, you can skip these steps; workflows run via miniwdl will run exclusively in the job queue to which they are submitted.
- Visit the AWS console.
- Navigate to the Batch service.
- In the lefthand sidebar, select "Compute environments". Note the name of the compute environment with the provisioning model SPOT (if you have deployed a context using spot instances) and the name of the compute environment with provisioning model "EC2" (if you have deployed a context that does not use spot instances).
- In the lefthand sidebar, select "Job queues".
- Clicking into an individual queue will show information about the compute environment ("Compute environment order"). Identify the job queue with the Compute environment name that matches the name you identified for the SPOT compute environment; copy the Amazon Resource Name (ARN) for this job queue. This is the value that should be used for the
aws_spot_queue_arn
. Repeat this process to find the ARN for theaws_on_demand_queue_arn
.
- If
preemptible = true
, only theaws_spot_queue_arn
is required. - If
preemptible = false
, only theaws_on_demand_queue_arn
is required.
From the directory where your agc-project.yaml
is located, run:
agc workflow run humanwgs --context <context> --inputsFile <input_file_path.json>
The running workflow can be monitored via agc workflow
commands, or via the AWS console.
AWS reference data is hosted in the us-west-2
region in the bucket s3://dnastack-resources
.
To use AWS reference data, add the following line to the data section of your agc-project.yaml
:
data:
- location: s3://dnastack-resources
readOnly: true
The AWS input file template has paths to the reference files in s3 prefilled. The template agc-project.template.yaml file has this section filled out already.
S3 buckets outside of the reference files can be accessed by adding additional data blocks to the agc-project.yaml file. See the agc documentation for more details on adding additional data sources. All inputs referenced in the inputs.json file will need to be at locations that have been configured in the agc-project.yaml.