Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Stops Tailing for ECS EC2 #31101

Open
gala-bsmith opened this issue Nov 14, 2024 · 0 comments
Open

[BUG] Stops Tailing for ECS EC2 #31101

gala-bsmith opened this issue Nov 14, 2024 · 0 comments

Comments

@gala-bsmith
Copy link

Agent Environment

root@ip-[REDACTED]:/# agent status
Getting the status from the agent.


===============
Agent (v7.59.0)
===============

  Status date: 2024-11-14 18:58:31.577 UTC (1731610711577)
  Agent start: 2024-11-14 18:56:59.147 UTC (1731610619147)
  Pid: 381
  Go Version: go1.19.5
  Python Version: 3.8.16
  Build arch: amd64
  Agent flavor: agent
  Check Runners: 4
  Log Level: info

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    System time: 2024-11-14 18:58:31.577 UTC (1731610711577)

  Host Info
  =========
    bootTime: 2024-11-13 21:02:17 UTC (1731531737000)
    hostId: ec2a7065-c340-a4cf-ae2a-96b25e6a4722
    kernelArch: x86_64
    kernelVersion: 4.14.353-270.569.amzn2.x86_64
    os: linux
    platform: ubuntu
    platformFamily: debian
    platformVersion: 22.04
    procs: 136
    uptime: 21h54m45s

  Hostnames
  =========
    ec2-hostname: ip-10-11-68-34.ec2.internal
    host_aliases: [i-06f9d1c7fd31a3cd2]
    instance-id: i-06f9d1c7fd31a3cd2
    socket-fqdn: ip-[REDACTED].ec2.internal.
    socket-hostname: ip-[REDACTED].ec2.internal
    hostname provider:
    unused hostname providers:
      'hostname' configuration/environment: hostname is empty
      'hostname_file' configuration/environment: 'hostname_file' configuration is not enabled

  Metadata
  ========
    agent_version: 7.59.0
    cloud_provider: AWS
    config_apm_dd_url:
    config_dd_url:
    config_logs_dd_url:
    config_logs_socks5_proxy_address:
    config_no_proxy: []
    config_process_dd_url:
    config_proxy_http:
    config_proxy_https:
    config_site:
    feature_apm_enabled: true
    feature_cspm_enabled: false
    feature_cws_enabled: false
    feature_logs_enabled: true
    feature_networks_enabled: false
    feature_networks_http_enabled: false
    feature_networks_https_enabled: false
    feature_otlp_enabled: false
    feature_process_enabled: false
    feature_processes_container_enabled: true
    feature_usm_go_tls_enabled: false
    feature_usm_java_tls_enabled: false
    flavor: agent
    install_method_installer_version: docker
    install_method_tool: docker
    install_method_tool_version: docker
    logs_transport: HTTP

=========
Collector
=========

  Running Checks
  ==============

    container
    ---------
      Instance ID: container [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/container.d/conf.yaml.default
      Total Runs: 6
      Metric Samples: Last Run: 69, Total: 414
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 3ms
      Last Execution Date : 2024-11-14 18:58:17 UTC (1731610697000)
      Last Successful Execution Date : 2024-11-14 18:58:17 UTC (1731610697000)


    docker
    ------
      Instance ID: docker [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/docker.d/conf.yaml.default
      Total Runs: 6
      Metric Samples: Last Run: 16, Total: 96
      Events: Last Run: 1, Total: 4
      Service Checks: Last Run: 1, Total: 6
      Average Execution Time : 4ms
      Last Execution Date : 2024-11-14 18:58:24 UTC (1731610704000)
      Last Successful Execution Date : 2024-11-14 18:58:24 UTC (1731610704000)


    ecs_fargate (3.3.0)
    -------------------
      Instance ID: ecs_fargate:fed08cd1baa0bef1 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/ecs_fargate.d/conf.yaml.default
      Total Runs: 5
      Metric Samples: Last Run: 61, Total: 305
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 5
      Average Execution Time : 10ms
      Last Execution Date : 2024-11-14 18:58:22 UTC (1731610702000)
      Last Successful Execution Date : 2024-11-14 18:58:22 UTC (1731610702000)

========
JMXFetch
========

  Information
  ==================
  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  Transactions
  ============
    Cluster: 0
    ClusterRole: 0
    ClusterRoleBinding: 0
    CronJob: 0
    CustomResource: 0
    CustomResourceDefinition: 0
    DaemonSet: 0
    Deployment: 0
    Dropped: 0
    HighPriorityQueueFull: 0
    Ingress: 0
    Job: 0
    Namespace: 0
    Node: 0
    PersistentVolume: 0
    PersistentVolumeClaim: 0
    Pod: 0
    ReplicaSet: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Role: 0
    RoleBinding: 0
    Service: 0
    ServiceAccount: 0
    StatefulSet: 0
    VerticalPodAutoscaler: 0

  Transaction Successes
  =====================
    Total number: 14
    Successes By Endpoint:
      check_run_v1: 5
      intake: 4
      series_v2: 5

  On-disk storage
  ===============
    On-disk storage is disabled. Configure `forwarder_storage_max_size_in_bytes` to enable it.

  API Keys status
  ===============
    API key ending with 4c456: API Key valid

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - 4c456

==========
Logs Agent
==========
    Reliable: Sending compressed logs in HTTPS to agent-http-intake.logs.datadoghq.com on port 443
    BytesSent: 6575
    EncodedBytesSent: 700
    LogsProcessed: 5
    LogsSent: 5

  ecsfargate
  ----------
    - Type: ecsfargate
      Service: example-ecs
      Source: example-source
      Status: Pending
      Bytes Read: 0
      Pipeline Latency:
        Average Latency (ms): 0
        24h Average Latency (ms): 0
        Peak Latency (ms): 0
        24h Peak Latency (ms): 0


=============
Process Agent
=============

  Version: 7.59.0
  Status date: 2024-11-14 18:58:32.007 UTC (1731610712007)
  Process Agent Start: 2024-11-14 18:56:59.449 UTC (1731610619449)
  Pid: 384
  Go Version: go1.19.5
  Build arch: amd64
  Log Level: info
  Enabled Checks: [container rtcontainer]
  Allocated Memory: 15,334,328 bytes
  Hostname:
  System Probe Process Module Status: Not running

  =================
  Process Endpoints
  =================
    https://process.datadoghq.com - API Key ending with:
        - 4c456

  =========
  Collector
  =========
    Last collection time: 2024-11-14 18:58:29
    Docker socket: /var/run/docker.sock
    Number of processes: 0
    Number of containers: 3
    Process Queue length: 0
    RTProcess Queue length: 0
    Connections Queue length: 0
    Event Queue length: 0
    Pod Queue length: 0
    Process Bytes enqueued: 0
    RTProcess Bytes enqueued: 0
    Connections Bytes enqueued: 0
    Event Bytes enqueued: 0
    Pod Bytes enqueued: 0
    Drop Check Payloads: []

=========
APM Agent
=========
  Status: Running
  Pid: 382
  Uptime: 92 seconds
  Mem alloc: 10,193,920 bytes
  Hostname:
  Receiver: 0.0.0.0:8126
  Endpoints:
    https://trace.agent.datadoghq.com

  Receiver (previous minute)
  ==========================
    No traces received in the previous minute.


  Writer (previous minute)
  ========================
    Traces: 0 payloads, 0 traces, 0 events, 0 bytes
    Stats: 0 payloads, 0 stats buckets, 0 bytes

==========
Aggregator
==========
  Checks Metric Sample: 973
  Dogstatsd Metric Sample: 903
  Event: 4
  Events Flushed: 4
  Number Of Flushes: 5
  Series Flushed: 1,140
  Service Check: 28
  Service Checks Flushed: 28

=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 902
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Bytes: 140,994
  Udp Packet Reading Errors: 0
  Udp Packets: 531
  Uds Bytes: 0
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 0
  Unterminated Metric Errors: 0

=============
Autodiscovery
=============
  Enabled Features
  ================
    docker
    ecsfargate

====
OTLP
====

  Status: Not enabled
  Collector status: Not running

Describe what happened:
A week back, we stood up a ECS EC2 for a service request my team had. We setup datadog-agent as a sidecar on the service, we found that ECS_FARGATE will utilize the ecsfargate endpoints to tail regardless of the value of environment variable; but when you completely remove the environment variable, the ecsfargate will not tail, and only tail the docker tailer.

Describe what you expected:
We expected, setting the variable to false, it wouldn't tail the ecsfargate endpoints, but it still does. We expected that setting the variable to false, it would only do docker tailing.

Steps to reproduce the issue:

  1. Setup your ECS Service.
{
    "taskDefinitionArn": "arn:aws:ecs:us-east-1:[REDACTED]:task-definition/dev-[REDACTED]:40",
    "containerDefinitions": [
        {
            "name": "datadog",
            "image": "datadog/agent:latest",
            "cpu": 256,
            "memory": 512,
            "portMappings": [
                {
                    "name": "datadog-agent-8126-tcp",
                    "containerPort": 8126,
                    "hostPort": 8126,
                    "protocol": "tcp"
                }
            ],
            "essential": true,
            "environment": [
                {
                    "name": "ECS_FARGATE", 
                    "value": "false"
                },
                {
                    "name": "DD_APM_ENABLED",
                    "value": "true"
                },
                {
                    "name": "DD_LOGS_ENABLED",
                    "value": "true"
                }
            ],
            "mountPoints": [
                {
                    "sourceVolume": "docker_sock",
                    "containerPath": "/var/run/docker.sock",
                    "readOnly": false
                },
                {
                    "sourceVolume": "proc",
                    "containerPath": "/host/proc",
                    "readOnly": false
                },
                {
                    "sourceVolume": "cgroup",
                    "containerPath": "/host/sys/fs/cgroup",
                    "readOnly": false
                },
                {
                    "sourceVolume": "pointdir",
                    "containerPath": "/opt/datadog-agent/run",
                    "readOnly": false
                },
                {
                    "sourceVolume": "containers_root",
                    "containerPath": "/var/lib/docker/containers",
                    "readOnly": true
                }
            ],
            "volumesFrom": [],
            "linuxParameters": {
                "initProcessEnabled": false
            },
            "secrets": [
                {
                    "name": "DD_API_KEY",
                    "valueFrom": "arn:aws:secretsmanager:us-east-1:[REDACTED]:secret:prod/datadog-ii1DyM:apikey::"
                }
            ],
            "startTimeout": 30,
            "stopTimeout": 120,
            "user": "0",
            "privileged": false,
            "readonlyRootFilesystem": false,
            "interactive": false,
            "pseudoTerminal": false,
            "logConfiguration": {
                "logDriver": "json-file",
                "options": {}
            },
            "healthCheck": {
                "command": [
                    "CMD-SHELL",
                    "agent health"
                ],
                "interval": 10,
                "timeout": 5,
                "retries": 3
            },
            "systemControls": []
        },
        {
            "name": "[REDACTED]",
            "image": "registry.gitlab.com/[REDACTED]:d6ab19eb",
            "repositoryCredentials": {
                "credentialsParameter": "arn:aws:secretsmanager:us-east-1:[REDACTED]:secret:gitlab-auth-YQdQIE"
            },
            "cpu": 256,
            "memory": 512,
            "portMappings": [
                {
                    "name": "[REDACTED]-80-tcp",
                    "containerPort": 80,
                    "hostPort": 80,
                    "protocol": "tcp"
                }
            ],
            "essential": true,
            "environment": [
                {
                    "name": "DD_AGENT_HOST",
                    "value": "datadog"
                },
                {
                    "name": "DD_VERSION",
                    "value": "d6ab19eb"
                },
                {
                    "name": "DD_SERVICE",
                    "value": "[REDACTED]-ecs"
                },
                {
                    "name": "DD_LOGS_INJECTION",
                    "value": "true"
                },
                {
                    "name": "ENABLE_TRACING",
                    "value": "true"
                },
                {
                    "name": "DD_ENV",
                    "value": "dev"
                },
                {
                    "name": "DD_LOGS_ENABLED",
                    "value": "true"
                },
                {
                    "name": "NODE_ENV",
                    "value": "development"
                },
                {
                    "name": "AWS_SECRETS_ID",
                    "value": "dev/[REDACTED]"
                }
            ],
            "mountPoints": [],
            "volumesFrom": [],
            "linuxParameters": {
                "initProcessEnabled": false
            },
            "startTimeout": 30,
            "stopTimeout": 120,
            "user": "0",
            "privileged": false,
            "readonlyRootFilesystem": true,
            "interactive": false,
            "pseudoTerminal": false,
            "dockerLabels": {
                "com.datadoghq.tags.service": "[REDACTED]-ecs",
                "com.datadoghq.tags.version": "1.0.0",
                "com.datadoghq.ad.logs": "[{\"source\": \"example-source\", \"service\": \"[REDACTED]-ecs\"}]",
                "com.datadoghq.tags.env": "dev"
            },
            "logConfiguration": {
                "logDriver": "json-file",
                "options": {}
            },
            "healthCheck": {
                "command": [
                    "CMD-SHELL",
                    "exit 0"
                ],
                "interval": 30,
                "timeout": 5,
                "retries": 3,
                "startPeriod": 60
            },
            "systemControls": []
        }
    ],
    "family": "dev-[REDACTED]",
    "taskRoleArn": "arn:aws:iam::[REDACTED]:role/dev-[REDACTED]-ECSTaskRole",
    "executionRoleArn": "arn:aws:iam::[REDACTED]:role/dev-[REDACTED]-20241015183150583800000003",
    "networkMode": "awsvpc",
    "revision": 40,
    "volumes": [
        {
            "name": "cgroup",
            "host": {
                "sourcePath": "/sys/fs/cgroup/"
            }
        },
        {
            "name": "pointdir",
            "host": {
                "sourcePath": "/opt/datadog-agent/run"
            }
        },
        {
            "name": "proc",
            "host": {
                "sourcePath": "/proc/"
            }
        },
        {
            "name": "containers_root",
            "host": {
                "sourcePath": "/var/lib/docker/containers/"
            }
        },
        {
            "name": "docker_sock",
            "host": {
                "sourcePath": "/var/run/docker.sock"
            }
        }
    ],
    "status": "ACTIVE",
    "requiresAttributes": [
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.17"
        },
        {
            "name": "com.amazonaws.ecs.capability.task-iam-role"
        },
        {
            "name": "ecs.capability.container-health-check"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
        },
        {
            "name": "ecs.capability.task-eni"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.29"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.24"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
        },
        {
            "name": "ecs.capability.secrets.asm.environment-variables"
        },
        {
            "name": "ecs.capability.private-registry-authentication.secretsmanager"
        },
        {
            "name": "ecs.capability.container-ordering"
        },
        {
            "name": "com.amazonaws.ecs.capability.logging-driver.json-file"
        }
    ],
    "placementConstraints": [],
    "compatibilities": [
        "EC2"
    ],
    "requiresCompatibilities": [
        "EC2"
    ],
    "cpu": "1024",
    "memory": "2048",
    "runtimePlatform": {
        "cpuArchitecture": "X86_64",
        "operatingSystemFamily": "LINUX"
    },
    "registeredAt": "2024-11-13T20:59:32.005Z",
    "registeredBy": "arn:aws:sts::[REDACTED]",
    "tags": []
}
  1. Go to the EC2 Instance, and do docker ps then docker logs <datadog_container> | grep -i tail
  2. Look for tailer open then closed on docker logs. This should not be happening...
    This happens, because the context switches back and forth from docker tailing and ecsfargate tailing. But remember, we set the ECS_FARGATE to false; so this technically should have never happened, to resolve this issue is we simply remove the environment variable.

Additional environment details (Operating System, Cloud provider, etc):
Cloud: AWS ECS
Operating System: Linux

I feel as this is a bug. If not, then this should be documented more clearly.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants