Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Overlay network not found on worker node #11894

Open
thormme opened this issue Jun 7, 2024 · 18 comments
Open

[BUG] Overlay network not found on worker node #11894

thormme opened this issue Jun 7, 2024 · 18 comments

Comments

@thormme
Copy link

thormme commented Jun 7, 2024

Description

Issue:
Swarm worker hosts fail to attach to manager node overlay networks unless a container has been manually started and attached to the network using docker run --network swarm-overlay

Expected Behavior:
This should automatically attach to the overlay network and it should be visible in the docker network info.

$> docker network ls
8e3c351af333   bridge             bridge    local
0cbc0420c111   docker_gwbridge    bridge    local
x8gb7mz6s222   swarm-overlay      overlay   swarm
c09ad17a7321   host               host      local
keth4xuub123   ingress            overlay   swarm
d8baa27f3654   none               null      local

Workaround:
The only solution I have found is to downgrade to an earlier version (2.21.0-1) of docker-compose-plugin

sudo apt list -a docker-compose-plugin
sudo apt install docker-compose-plugin=2.21.0-1~debian.11~bullseye

I believe this is the same issue as #11387 but i couldn't find any open bugs with the same issue.

Thanks for any help with this!

Steps To Reproduce

I created a custom overlay network on the swarm manager node.

...
  service:
    image: service-image
    container_name: service
    networks:
      - swarm-overlay
    restart: unless-stopped
...
networks:
  swarm-overlay:
    attachable: true
    driver: overlay

This correctly created the network and attached the relevant container to it.

I then joined a worker host to the swarm and attempted to connect a container to the overlay network.

...
worker-service:
    image: worker-image
    container_name: worker-service
    networks:
      swarm-overlay:
        aliases:
          - host1-worker-service
    restart: unless-stopped
...
networks:
  swarm-overlay:
    external: true
    driver: overlay

docker compose up -d worker-service
This errors with:

Error response from daemon: network swarm-overlay not found

Compose Version

docker-compose-plugin/bullseye 2.27.1-1~debian.11~bullseye
Docker Compose version v2.27.1

Docker Environment

Client: Docker Engine - Community
 Version:    26.1.4
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.14.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.27.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 12
  Running: 5
  Paused: 0
  Stopped: 7
 Images: 31
 Server Version: 26.1.4
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: active
  NodeID: 2brhg9vzj8m47oyo40ie5yj0u
  Is Manager: false
  Node Address: 1.2.3.4
  Manager Addresses:
   4.3.2.1:2377
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: d2d58213f83a351ca8f528a95fbd145f5654e957
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.10.0-28-cloud-amd64
 Operating System: Debian GNU/Linux 11 (bullseye)
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 13.42GiB
 Name: cloud-machine
 ID: 6c0ae974-1ba3-450a-ab03-d31b31c6097f
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Anything else?

No response

@ndeloof
Copy link
Contributor

ndeloof commented Jun 10, 2024

This isn't the same issue as #11387 as here this is the docker engine reporting error: Error response from daemon: network swarm-overlay not found

Can you please confirm you can use docker run --network swarm-overlay ... to run equivalent container on worked node with this swarm setup ?

@jsunstrom
Copy link

I'm running into this exact same issue using Docker Compose 2.27.0. I can confirm that I can use docker run -it --name alpine1 --network test-net alpine from the official documentation. I walked through the entirety of the "Use an overlay network for standalone containers" and it worked as expected.

However, using docker compose files, I also get the error Error response from daemon: network <my network name here> not found message using docker compose up -d.

@ambretanmay
Copy link

ambretanmay commented Jun 11, 2024

I am having the exact same issue.
Docker Compose version v2.27.1
@ndeloof docker run --network swarm-overlay works and compose doesn't

@inql
Copy link

inql commented Jun 27, 2024

btw is the downgrade workaround needed for both leader and worker node?

@ambretanmay
Copy link

@inql I have not tested this as our scripts set versions for all nodes.

@michaelmcandrew
Copy link

michaelmcandrew commented Jul 3, 2024

Hey there, also affected by this bug.

If you don't want to downgrade another workaround is to create a container and attach it to the network. It then appears in the list and docker compose no longer complains

docker run -dit --name keep-alive --network --restart=always <network_name> alpine

Adding --restart=always will ensure that it survives restarts of the docker daemon, etc.

My versions in case it is useful:

docker version

Client: Docker Engine - Community
Version: 27.0.3
API version: 1.46
Go version: go1.21.11
Git commit: 7d4bcd8
Built: Sat Jun 29 00:02:50 2024
OS/Arch: linux/amd64
Context: default

Server: Docker Engine - Community
Engine:
Version: 27.0.3
API version: 1.46 (minimum version 1.24)
Go version: go1.21.11
Git commit: 662f78c
Built: Sat Jun 29 00:02:50 2024
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.7.18
GitCommit: ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
runc:
Version: 1.7.18
GitCommit: v1.1.13-0-g58aa920
docker-init:
Version: 0.19.0
GitCommit: de40ad0

docker compose version

Docker Compose version v2.28.1

@kulpsin
Copy link

kulpsin commented Jul 4, 2024

As in above, sorry did not realise that @michaelmcandrew also mentioned this but at least this comment confirms his findings: #11894 (comment)

I tested this issue and noticed that if there exists running container which has connection to the external overlay network (started with docker run ... and visible in docker network ls), then the compose is able to connect to the external overlay network.

So, without knowing anything about internals, the problem might have something to do with not checking for available external overlay networks but instead checking just internal networks (visible with docker network ls).

So as an additinal workaround it is possible to first start "dummy" container on workers via for example:

$ docker compose up -d
Error response from daemon: network <overlay-network> not found
$ run -dit --rm --name dummy-network-container --network <overlay-network> alpine
43924b1b25ac73373aac9120b55ac46fc1de3435ce26485682e11d6c06671936
$ docker compose up -d
[+] Running 1/0
 ✔ Container worker-service  Started
$ _

I also checked downgrading and for Ubuntu 22.04 it worked, so I think I will be using downgraded version for now myself.
sudo apt-get remove docker-compose-plugin && sudo apt-get install docker-compose-plugin=2.21.0-1~ubuntu.22.04~jammy

$ docker version
Client: Docker Engine - Community
 Version:           27.0.3
 API version:       1.46
 Go version:        go1.21.11
 Git commit:        7d4bcd8
 Built:             Sat Jun 29 00:02:33 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          27.0.3
  API version:      1.46 (minimum version 1.24)
  Go version:       go1.21.11
  Git commit:       662f78c
  Built:            Sat Jun 29 00:02:33 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.18
  GitCommit:        ae71819c4f5e67bb4d5ae76a6b735f29cc25774e
 runc:
  Version:          1.7.18
  GitCommit:        v1.1.13-0-g58aa920
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

$ docker compose version
Docker Compose version v2.28.1

@ndeloof
Copy link
Contributor

ndeloof commented Jul 4, 2024

@kulpsin docker network ls indeed does not detect overlay networks created on another swarm node (not sure about the reason, but that's what we get with the engine API) until it is used by some container. So Docker Compose can't check network existence, but should detect swarm is enabled and ignore error (assuming container create will fail if there's an actual missing network). See

compose/pkg/compose/create.go

Lines 1334 to 1340 in 11d5ecd

if enabled {
// Swarm nodes do not register overlay networks that were
// created on a different node unless they're in use.
// So we can't preemptively check network exists, but
// networkAttach will later fail anyway if network actually doesn't exists
return nil
}

Not sure why this doesn't work as expected, need to setup a test environment and try to reproduce this bug

@jhrotko
Copy link
Contributor

jhrotko commented Jul 18, 2024

With the original compose.yml it would generate swarm-netword-overlay_swarm-overlay network
Screenshot 2024-07-18 at 15 57 57
...and then the worker would not be able to find the external network as expected

By adding the name: swarm-overlay on the network it made it work for me for version v2.28.1
docker compose up -d

...
  service:
    image: service-image
    container_name: service
    networks:
      - swarm-overlay
    restart: unless-stopped
...
networks:
  swarm-overlay:
    name: swarm-overlay <---- 
    attachable: true
    driver: overlay

after this it generates the following result for docker network ls
Screenshot 2024-07-18 at 16 00 19

and now the worker is referencing the right network
Screenshot 2024-07-18 at 16 07 00

@michaelmcandrew
Copy link

To flesh out my steps to reproduce a bit more, since they are slightly different from the ones mentioned above, I created a swarm network on the lead node with docker network create --driver overlay test --attachable.

This network was not visible on the worker node (expected I think because nothing was connected).

However, I was not able to connect to it with the below networks section in a compose.yaml on the worker node.

networks:
  test:
    external: true

I created the following container on the worker node docker run -dit --name keep-alive --network test --restart=always alpine

I was then able to connect using the above networks section in a compose.yaml on the worker node.

Hope that help with the reproduction!

@tuxthepenguin84
Copy link

I created the following container on the worker node docker run -dit --name keep-alive --network test --restart=always alpine

Thanks this worked for me.

@tuxthepenguin84
Copy link

Is this a bug in compose? I would expect somewhat feature parity between docker and docker compose.

@ndeloof
Copy link
Contributor

ndeloof commented Oct 23, 2024

@tuxthepenguin84 docker compose does some client-side validation before running containers, and as such looks for target network to exist. docker run will just fail if not found, without preliminary validation.
Can you please confirm issue persists with latest version ? AFAIK we had a fix for it

@tuxthepenguin84
Copy link

It appears to me the issue still persists, at least for me and my use case.

Docker Compose version v2.29.7
Client: Docker Engine - Community
 Version:           27.3.1
 API version:       1.47
 Go version:        go1.22.7
 Git commit:        ce12230
 Built:             Fri Sep 20 11:41:00 2024
 OS/Arch:           linux/amd64
 Context:           default
[+] Running 3/3
 ✔ Container proxy2-nginx-exporter  Removed                                                                                                        0.5s
 ✔ Container proxy2                 Removed                                                                                                        1.8s
 ✔ Network proxy_default            Removed                                                                                                        0.4s
[+] Running 2/3
 ✔ Network proxy_default            Created                                                                                                        0.8s
 ⠸ Container proxy2                 Starting                                                                                                       2.3s
 ✔ Container proxy2-nginx-exporter  Started                                                                                                        2.0s
Error response from daemon: could not find a network matching network mode jf5y7525s7qqt0333lfolwruk: network jf5y7525s7qqt0333lfolwruk not found
[
    {
        "Name": "ai",
        "Id": "jf5y7525s7qqt0333lfolwruk",
        "Created": "2024-10-06T20:26:15.848600039Z",
        "Scope": "swarm",
        "Driver": "overlay",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "10.0.3.0/24",
                    "Gateway": "10.0.3.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": true,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": null,
        "Options": {
            "com.docker.network.driver.overlay.vxlanid_list": "4099"
        },
        "Labels": null
    }
]

The network is there.

services:
  proxy2:
    image: nginx:latest
    container_name: proxy2
    restart: unless-stopped
    networks: ['ai', 'collaboration', 'core', 'garage', 'health', 'iot', 'olivetin', 'media', 'metrics', 'proxy', 'security', 'sprinklers']
    ports:
      - 443:443
    volumes:
      - /containers/proxy/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - /containers/proxy/nginx/conf.d:/etc/nginx/conf.d:ro
      - /containers/proxy/dhparams.pem:/etc/ssl/dhparams.pem:ro
      - /certs/delchampsio/fullchain.pem:/etc/ssl/delchampsio/fullchain.pem:ro
      - /certs/delchampsio/privkey.pem:/etc/ssl/delchampsio/privkey.pem:ro
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro

  proxy2-nginx-exporter:
    image: nginx/nginx-prometheus-exporter:latest
    container_name: proxy2-nginx-exporter
    restart: unless-stopped
    ports:
      - 9113:9113
    command:
      - --nginx.scrape-uri=http://proxy2:8080/nginx_status

networks:
  ai:
    name: ai
    driver: overlay
    external: true
  collaboration:
    name: collaboration
    driver: overlay
    external: true
  core:
    name: core
    driver: overlay
    external: true
  garage:
    name: garage
    driver: overlay
    external: true
  health:
    name: health
    driver: overlay
    external: true
  iot:
    name: iot
    driver: overlay
    external: true
  olivetin:
    name: olivetin
    driver: overlay
    external: true
  media:
    name: media
    driver: overlay
    external: true
  metrics:
    name: metrics
    driver: overlay
    external: true
  proxy:
    name: proxy
    driver: overlay
    external: true
  security:
    name: security
    driver: overlay
    external: true
  sprinklers:
    name: sprinklers
    driver: overlay
    external: true

If I run the following and get a container up and running on that "missing" network, I can get the container started with compose

docker run -dit --rm --name dummy-network-container --network ai alpine

Let me know if you need more info or want me to try something, I'm happy to help out and work on getting this fixed.

@ndeloof
Copy link
Contributor

ndeloof commented Oct 25, 2024

@tuxthepenguin84 could you please give binary from #12233 a try (binaries available on https://github.com/docker/compose/actions/runs/11513518822, at bottom) ?

This adds some debugs to the network resolution logic that will help diagnose this issue
run as docker compose --verbose --progress=plain up

@tuxthepenguin84
Copy link

Thanks I'll try that out and report back.

@aek
Copy link

aek commented Nov 14, 2024

@ndeloof I have the issue with the compose plugin version v2.27.0 running on Ubuntu Server 24.04 with ARM Arch

Here is the output of testing the binary from #12233

/etc/salt/docker/test # /etc/salt/docker/docker-compose-linux-aarch64 --verbose --progress=plain up -d
DEBU[0000] search network "axel5" by name returned: 0   
DEBU[0000] search network "axel5" by ID succeeded       
DEBU[0000] networks matching name "axel5" after strict filtering: 0 
DEBU[0000] no match, swarm is enabled: true             
 Container test-dummy-1  Recreate
DEBU[0005] otel error                                    error="<nil>"
 Container test-dummy-1  Recreated
 Container test-dummy-1  Starting
 Container test-dummy-1  Started
DEBU[0010] otel error                                    error="<nil>"
DEBU[0010] otel error                                    error="<nil>"

This version properly creates the network

Here is my docker info output

/etc/salt/docker/test # docker info
Client:
 Version:    26.1.5
 Context:    default
 Debug Mode: false
 Plugins:
  compose: Docker Compose (Docker Inc.)
    Version:  v2.27.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 11
  Running: 6
  Paused: 0
  Stopped: 5
 Images: 13
 Server Version: 27.3.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: active
  NodeID: mi4aclsip2vfc0fmdk0lizvoi
  Is Manager: false
  Node Address: 172.31.41.5
  Manager Addresses:
   172.31.45.225:2377
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 57f17b0a6295a39009d861b89e3b3b87b005ca27
 runc version: v1.1.14-0-g2c9f560
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.8.0-1016-aws
 Operating System: Ubuntu 24.04.1 LTS
 OSType: linux
 Architecture: aarch64
 CPUs: 4
 Total Memory: 7.582GiB
 Name: ip-172-31-41-5
 ID: aebad7d3-d242-435a-a215-9e10a8a1a6b1
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Labels:
  salt-minion=dd6de55b-6f41-4cfd-924f-1231ed03995b
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

will try with the latest and report

@aek
Copy link

aek commented Nov 14, 2024

My issue was that I have 2 versions of docker compose:

  • version 2.29 in Ubuntu Server on the host
  • version 2.27 in Alpine Linux for a container with the docker.sock bind mounted
    I run my compose commands inside the alpine container with the compose cli version 2.27 because that's the version that ships with alpine 3.20

I fix it by installing the latest from edge like this:

apk add docker-cli docker-cli-compose  --repository=https://dl-cdn.alpinelinux.org/alpine/edge/community

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants