Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Envoy's lb_policy: ROUND_ROBIN is not balancing cluster hosts #37369

Open
rookrunner opened this issue Nov 26, 2024 · 5 comments
Open

Envoy's lb_policy: ROUND_ROBIN is not balancing cluster hosts #37369

rookrunner opened this issue Nov 26, 2024 · 5 comments

Comments

@rookrunner
Copy link

rookrunner commented Nov 26, 2024

Title: Envoy's lb_policy: ROUND_ROBIN is not balancing cluster hosts

Description:

I have a envoy configuration like this:

 - name: testserverpolling
    address:
      socket_address:
        address: 127.0.0.1
        port_value: 8129
        protocol: TCP
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          codec_type: AUTO
          stat_prefix: testserverpolling
          route_config:
            name: testserverpolling
            virtual_hosts:
            - name: testserverpolling
              domains:
              - "*"
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: testserverpolling
                  timeout:
                    seconds: 1800
                decorator:
                  operation: testserverpolling
              require_tls: NONE
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
          server_name: envoy-ingress-testserverpolling
          drain_timeout:
            seconds: 1800
          common_http_protocol_options:
            idle_timeout: 1800s
          access_log:
          - name: envoy.access_loggers.file
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog
              path: "/var/DebugTrace/envoylb/envoylb_testserverpolling.log"
          generate_request_id: true
  - name: testserverpolling
    type: STRICT_DNS
    connect_timeout:
      seconds: 1800
    lb_policy: ROUND_ROBIN
    transport_socket:
      name: envoy.transport_sockets.tls
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
        common_tls_context:
          tls_params:
            tls_minimum_protocol_version: "TLSv1_2"
          tls_certificates:
          - certificate_chain:
              filename: "/certs/cert.pem"
            private_key:
              filename: "/certs/key.pem"
            password:
              filename: "/etc/pkencryptkey/password"
    load_assignment:
      cluster_name: testserverpolling
      endpoints:
        - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: test-ckey-svc-headless.{{ .Release.Namespace }}.{{ .Values.global.servicedomainName }}.
                    port_value: 8124
    typed_extension_protocol_options:
      envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
        "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
        explicit_http_config:
          http_protocol_options: { }

I have 6 replicas of the pods which are behind the service - test-ckey-svc-headless.{{ .Release.Namespace }}.{{ .Values.global.servicedomainName }}.

My expectation was when I send 6 requests to envoy, it will evenly distribute 1 requests each to all the pods. But, it is not evenly distributing. Some pods do not even get any requests.

[testvm ~]$ kubectl exec -it test-sd-ckey-0 -n testns -- bash -c "wc -l /var/DebugTrace/envoy/test_polling_service.log"
3 /var/DebugTrace/envoy/test_polling_service.log
[testvm ~]$ kubectl exec -it test-sd-ckey-1 -n testns -- bash -c "wc -l /var/DebugTrace/envoy/test_polling_service.log"
0 /var/DebugTrace/envoy/test_polling_service.log
[testvm ~]$ kubectl exec -it test-sd-ckey-2 -n testns -- bash -c "wc -l /var/DebugTrace/envoy/test_polling_service.log"
2 /var/DebugTrace/envoy/test_polling_service.log
[testvm ~]$ kubectl exec -it test-sd-ckey-3 -n testns -- bash -c "wc -l /var/DebugTrace/envoy/test_polling_service.log"
0 /var/DebugTrace/envoy/test_polling_service.log
[testvm ~]$ kubectl exec -it test-sd-ckey-4 -n testns -- bash -c "wc -l /var/DebugTrace/envoy/test_polling_service.log"
1 /var/DebugTrace/envoy/test_polling_service.log
[testvm ~]$ kubectl exec -it test-sd-ckey-5 -n testns -- bash -c "wc -l /var/DebugTrace/envoy/test_polling_service.log"
0 /var/DebugTrace/envoy/test_polling_service.log

Is there anything wrong in my configuration. Can anyone please help.

I tried using max_requests_per_connection = 1, in the cluster section but, that too didnt help.

@rookrunner rookrunner added the triage Issue requires triage label Nov 26, 2024
@RyanTheOptimist RyanTheOptimist added bug area/load balancing and removed triage Issue requires triage labels Nov 26, 2024
@RyanTheOptimist
Copy link
Contributor

@wbpcode @tonya11en @nezdolik

@tonya11en
Copy link
Member

Are there log files showing the behavior?

@wbpcode
Copy link
Member

wbpcode commented Nov 27, 2024

how much concurrency you used? Note the round robin is a thread local load balancer, it only ensure the requests in same thread will be destributed to the hosts in turn.

@rookrunner
Copy link
Author

@wbpcode Using concurrency as 8.
referred this: https://www.envoyproxy.io/docs/envoy/latest/faq/load_balancing/concurrency_lb

/Envoy $ ps -aef | grep envoy

1001050+ 436 293 0 Nov26 pts/0 00:04:13 /usr/local/bin/envoy --config-path /tmp/envoy_bkp.yaml --base-id 0 --concurrency 8 --drain-time-s 30 --drain-strategy immediate --parent-shutdown-time-s 40 --restart-epoch 1
1001050+ 23935 23808 0 03:28 pts/2 00:00:00 grep envoy

@rookrunner
Copy link
Author

@tonya11en

I tried running envoy in debug and got this log.
/usr/local/bin/envoy --config-path /tmp/envoy.yaml --base-id 0 --concurrency 8 --drain-time-s 30 --log-level debug --drain-strategy immediate --parent-shutdown-time-s 40 --restart-epoch 1
It does not have much info though

[2024-11-27 03:52:00.102][24477][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:302] dns resolution for test-ckey-svc-headless.testns.svc.cluster.local. completed with status 0
[2024-11-27 03:52:00.102][24477][debug][upstream] [source/common/upstream/upstream_impl.cc:469] transport socket match, socket default selected for host with address 192.168.46.68:8002
[2024-11-27 03:52:00.102][24477][debug][upstream] [source/common/upstream/upstream_impl.cc:469] transport socket match, socket default selected for host with address 192.168.61.84:8002
[2024-11-27 03:52:00.102][24477][debug][upstream] [source/common/upstream/upstream_impl.cc:469] transport socket match, socket default selected for host with address 192.168.27.220:8002
[2024-11-27 03:52:00.102][24477][debug][upstream] [source/common/upstream/upstream_impl.cc:469] transport socket match, socket default selected for host with address 192.168.28.217:8002
[2024-11-27 03:52:00.102][24477][debug][upstream] [source/common/upstream/upstream_impl.cc:469] transport socket match, socket default selected for host with address 192.168.45.147:8002
[2024-11-27 03:52:00.102][24477][debug][upstream] [source/common/upstream/upstream_impl.cc:469] transport socket match, socket default selected for host with address 192.168.40.191:8002
[2024-11-27 03:52:00.102][24477][debug][upstream] [source/extensions/clusters/strict_dns/strict_dns_cluster.cc:193] DNS refresh rate reset for test-ckey-svc-headless.testns.svc.cluster.local.,, refresh rate 5000 ms

This is how I am sending requests to port 8129 exposed by envoy:
for i in $(seq 1 6); do echo "Attempt #$i"; curl -vvv http://127.0.0.1:8129/ack_route; echo; done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants