Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add EMR runtime for flink operator blueprint #485

Merged
merged 25 commits into from
Apr 25, 2024

Conversation

mithun008
Copy link
Contributor

What does this PR do?

🛑 Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.

Added blue print to support EMR Flink operator on EKS.

Motivation

EMR Flink operator enhances the open source Flink operator by addition AWS specific capabilities for scaling, logging and monitoring.

More

  • [ x] Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
  • [ x] Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
  • [ x] Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
  • [ x] Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

  • E2E Test successfully complete before merge?

Additional Notes

|------|---------|
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 1.0.0 |
| <a name="requirement_aws"></a> [aws](#requirement\_aws) | >= 3.72 |
| <a name="requirement_helm"></a> [helm](#requirement\_helm) | >= 2.4.1 |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to use Helm >= v3.8.0 where OCI support is enabled by default. https://helm.sh/docs/topics/registries/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the Terraform provider version and the latest version is 2.13.0 https://registry.terraform.io/providers/hashicorp/helm/latest/docs/guides/v2-upgrade-guide and it uses latest Helm version 3.13.2 https://github.com/hashicorp/terraform-provider-helm/blob/8d8f0503c39f9fb2d718805e5a619ef56728271b/go.mod#L13

@mithun008 Upgrade this to 2.13 in versions.tf file and rerun it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

enable_aws_load_balancer_controller = true

#---------------------------------------
# AWS for FluentBit - DaemonSet

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use FluentBit and DaemonSet. Instead, you may set up monitoring configurations to archive application logs to S3/CW.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right @mengfwanaws! We use FluentBit to capture the logs from other k8s add-ons and apps running on the EKS cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done




flinkVersion: v1_17

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emr-7.0.0 should use v1_18.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# add service account
# serviceAccount: flink-team-a-sa

executionRoleArn: arn:aws:iam::681921237057:role/emr-eks-flink-flink-team-a-20240406012025932700000008

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who's this 681921237057? Shall we dynamically set it up for customer acct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it and added comments as well as docs.

taskmanager.numberOfTaskSlots: "2"
state.savepoints.dir: file:///flink/data/checkpoint/savepoints
state.checkpoints.dir: file:///flink/data/checkpoint/checkpoints
high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deprecated config. Use high-availability.type: kubernetes instead.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use fluentbit. Please refer to monitoring configuration instead.

3. Automatically tunes Autoscaler configurations based on historical trends of observed metrics.
4. Faster Flink Job Restart during scaling or Failure Recovery
5. IRSA (IAM Roles for Service Accounts) Native Integration
6. Apache Airflow Integration

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see an example in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Copy link
Collaborator

@vara-bonthu vara-bonthu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @mithun008! I've suggested some changes.

Overall, I would like you to align the blueprint more closely with emr-eks-karpenter. Additionally, we should include more examples, which can be added in subsequent PRs.

enable_aws_load_balancer_controller = true

#---------------------------------------
# AWS for FluentBit - DaemonSet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right @mengfwanaws! We use FluentBit to capture the logs from other k8s add-ons and apps running on the EKS cluster.

README.md Outdated Show resolved Hide resolved
|------|---------|
| <a name="requirement_terraform"></a> [terraform](#requirement\_terraform) | >= 1.0.0 |
| <a name="requirement_aws"></a> [aws](#requirement\_aws) | >= 3.72 |
| <a name="requirement_helm"></a> [helm](#requirement\_helm) | >= 2.4.1 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the Terraform provider version and the latest version is 2.13.0 https://registry.terraform.io/providers/hashicorp/helm/latest/docs/guides/v2-upgrade-guide and it uses latest Helm version 3.13.2 https://github.com/hashicorp/terraform-provider-helm/blob/8d8f0503c39f9fb2d718805e5a619ef56728271b/go.mod#L13

@mithun008 Upgrade this to 2.13 in versions.tf file and rerun it

# For example only - please evaluate for your environment
force_destroy = true

attach_deny_insecure_transport_policy = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are defaults i think now

streaming/emr-eks-flink/addons.tf Show resolved Hide resolved
"karpenter.sh/discovery" = local.name
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra lines

}

#import module vpc
module "vpc" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to vpc.tf file

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow the existing emr-eks-karpenter blueprint as guidance https://github.com/awslabs/data-on-eks/blob/main/analytics/terraform/emr-eks-karpenter/vpc.tf



# deploy a helm chart for flink-kubernetes-operator
resource "helm_release" "flink_kubernetes_operator" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this and use this add-on from EKS Data addons https://github.com/aws-ia/terraform-aws-eks-data-addons/blob/main/emr-flink-operator.tf. Update the addon in this repo if required

@@ -0,0 +1,9 @@
# create output for flink operator role arn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this

################################################################################
# EKS Managed Node Group
################################################################################

output "configure_kubectl" {
  description = "Configure kubectl: make sure you're logged in with the correct AWS profile and run the following command to update your kubeconfig"
  value       = "aws eks --region ${local.region} update-kubeconfig --name ${module.eks.cluster_name}"
}

@@ -0,0 +1,36 @@

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this file and move these providers to main.tf

@@ -143,11 +139,27 @@ module "eks_blueprints_addons" {
# Data on EKS Kubernetes Addons
#---------------------------------------------------------------
module "eks_data_addons" {

depends_on = [module.flink_irsa_jobs, module.flink_irsa_operator]
#source = "[email protected]:mithun008/terraform-aws-eks-data-addons/"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


executionRoleArn: arn:aws:iam::xxxxxxxxx:role/emr-eks-flink-flink-team-a-20240406012025932700000008
# Replace this execution role ARN with your own
executionRoleArn: arn:aws:iam::681921237057:role/emr-eks-flink-flink-team-a-20240406012025932700000008

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should substitute this with variables, or provide instructions on how to generate the role arn (including the permissions).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the instructions in the website docs to substitute these values.

state.savepoints.dir: s3://emr-flink-data/savepoints

# Replace this execution role ARN with your own
executionRoleArn: arn:aws:iam::681921237057:role/emr-eks-flink-flink-team-a-20240406012025932700000008

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

@mithun008
Copy link
Contributor Author

@vara-bonthu I have addressed your comments. Thank you!

@@ -0,0 +1,27 @@
# create output for flink operator role arn
output "flink_jobs_role_arn" {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flink_job_execution_role_arn

Comment on lines 11 to 27
output "flink_checkpoint_path" {
value = "s3://${module.s3_bucket.s3_bucket_id}/checkpoints"
description = "S3 path for checkpoint data"
}
output "flink_savepoint_path" {
value = "s3://${module.s3_bucket.s3_bucket_id}/savepoints"
description = "S3 path for savepoint data"
}
output "flink_jobmanager_path" {
value = "s3://${module.s3_bucket.s3_bucket_id}/jobmanager"
description = "S3 path for jobmanager data"
}

output "flink_logs_path" {
value = "s3://${module.s3_bucket.s3_bucket_id}/logs"
description = "S3 path for logs"
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove all of these and add only s3bucet id


jobmanager.scheduler: adaptive
# Replace with s3 bucket in your own account
state.checkpoints.dir: s3://emr-flink-data/checkpoints
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<<REPLACE_WITH_S3_BUCKET_ID>>

# backend "s3" {
# bucket = "doeks-github-actions-e2e-test-state"
# region = "us-west-2"
# key = "e2e/flink/terraform.tfstate"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emr-eks-flink

Copy link
Collaborator

@vara-bonthu vara-bonthu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left few minor comments otherwise its good

# EMR Flink operator
#---------------------------------------------------------------
enable_emr_flink_operator = true
emr_flink_operator_helm_config = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace this bloc with

  emr_flink_operator_helm_config = {
    repository               = "oci://public.ecr.aws/emr-on-eks"
    operatorExecutionRoleArn = module.flink_irsa_operator.iam_role_arn
  }
  

You can remove the repository and operatorExecutionRoleArn once you update these in Data addons repo

@@ -0,0 +1,64 @@
# NOTE: Make sure you you replce <ENTER_S3_BUCKET> with your flink_operator_bucket output before running this job
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check spelling and repeated words

@@ -0,0 +1,74 @@
# NOTE: Make sure you you replce <ENTER_S3_BUCKET> and with your S3 Bucket before running this job.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check spelling and repeated words

@@ -0,0 +1,75 @@
# NOTE: Make sure you you replce <ENTER_S3_BUCKET> with your flink_operator_bucket output before running this job
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check spelling and repeated words

Copy link
Collaborator

@vara-bonthu vara-bonthu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@vara-bonthu vara-bonthu merged commit ad20de8 into awslabs:main Apr 25, 2024
36 of 39 checks passed
ovaleanu pushed a commit to ovaleanu/data-on-eks that referenced this pull request Aug 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants