Hi,
I played with the Consul helm chart on a minikube cluster to try and make a POC where I connect a nomad cluster to a Consul server that is running on K8S (Nomad is not on K8S).
values.yaml:
global:
enabled: true
name: consul
datacenter: dc1
tls:
enabled: true
acls:
manageSystemACLs: true
server:
enabled: true
exposeService:
enabled: true
type: NodePort
replicas: 1
ui:
enabled: true
service:
type: NodePort
I exposed the Consul service using the exposeService
helm value and a NodePort service was created in the cluster.
Then, I wanted to try and connect my host machine using a Consul agent to the cluster(as a client).
This was the config.json I used for the client
{
"server": false,
"data_dir": "DATA_DIR",
"log_level": "INFO",
"datacenter": "dc1",
"retry_join": ["Minikube host address"],
"advertise_addr": "Host machine address on the minikube interface",
"bind_addr": "0.0.0.0",
"ports": {
"serf_lan": 31267,
"server": 32449
}
}
kubectl get svc:
consul consul-expose-servers NodePort 10.100.172.33 <none> 8501:31057/TCP,8301:31267/TCP,8300:32449/TCP,8502:32528/TCP
This is the output that I’m getting when I run the agent:
==> Starting Consul agent...
Version: '1.14.4'
Build Date: '2023-01-26 15:47:10 +0000 UTC'
Node ID: '9e5b8521-0444-9863-f669-2d8f231c0502'
Node name: 'DESKTOP'
Datacenter: 'dc1' (Segment: '')
Server: false (Bootstrap: false)
Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: -1, gRPC-TLS: -1, DNS: 8600)
Cluster Addr: 192.168.56.1 (LAN: 31267, WAN: 8302)
Gossip Encryption: false
Auto-Encrypt-TLS: false
HTTPS TLS: Verify Incoming: false, Verify Outgoing: false, Min Version: TLSv1_2
gRPC TLS: Verify Incoming: false, Min Version: TLSv1_2
Internal RPC TLS: Verify Incoming: false, Verify Outgoing: false (Verify Hostname: false), Min Version: TLSv1_2
==> Log data will now stream in as it occurs:
2023-01-27T23:34:46.713+0200 [INFO] agent.client.serf.lan: serf: EventMemberJoin: DESKTOP 192.168.56.1
2023-01-27T23:34:46.713+0200 [INFO] agent.router: Initializing LAN area manager
2023-01-27T23:34:46.718+0200 [INFO] agent: Started DNS server: address=127.0.0.1:8600 network=udp
2023-01-27T23:34:46.718+0200 [INFO] agent: Started DNS server: address=127.0.0.1:8600 network=tcp
2023-01-27T23:34:46.718+0200 [INFO] agent: Starting server: address=127.0.0.1:8500 network=tcp protocol=http
2023-01-27T23:34:46.718+0200 [INFO] agent: started state syncer
2023-01-27T23:34:46.718+0200 [INFO] agent: Consul agent running!
2023-01-27T23:34:46.718+0200 [INFO] agent: Retry join is supported for the following discovery methods: cluster=LAN discovery_methods="aliyun aws azure digitalocean gce hcp k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere"
2023-01-27T23:34:46.719+0200 [INFO] agent: Joining cluster...: cluster=LAN
2023-01-27T23:34:46.719+0200 [INFO] agent: (LAN) joining: lan_addresses=["192.168.59.106"]
2023-01-27T23:34:46.718+0200 [WARN] agent.router.manager: No servers available
2023-01-27T23:34:46.719+0200 [ERROR] agent.anti_entropy: failed to sync remote state: error="No known Consul servers"
2023-01-27T23:34:46.721+0200 [INFO] agent.client.serf.lan: serf: EventMemberJoin: consul-server-0 172.17.0.6
2023-01-27T23:34:46.724+0200 [INFO] agent: (LAN) joined: number_of_nodes=1
2023-01-27T23:34:46.724+0200 [INFO] agent.client: adding server: server="consul-server-0 (Addr: tcp/172.17.0.6:8300) (DC: dc1)"
2023-01-27T23:34:46.724+0200 [INFO] agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=1
2023-01-27T23:34:48.714+0200 [INFO] agent.client.memberlist.lan: memberlist: Suspect consul-server-0 has failed, no acks received
2023-01-27T23:34:51.716+0200 [INFO] agent.client.memberlist.lan: memberlist: Suspect consul-server-0 has failed, no acks received
2023-01-27T23:34:52.590+0200 [WARN] agent: Check is now critical: check=_nomad-check-0687652acd1317d58ee6cc066752a52ca9970ff2
2023-01-27T23:34:52.715+0200 [INFO] agent.client.memberlist.lan: memberlist: Marking consul-server-0 as failed, suspect timeout reached (0 peer confirmations)
2023-01-27T23:34:52.716+0200 [INFO] agent.client.serf.lan: serf: EventMemberFailed: consul-server-0 172.17.0.6
2023-01-27T23:34:52.716+0200 [INFO] agent.client: removing server: server="consul-server-0 (Addr: tcp/172.17.0.6:8300) (DC: dc1)"
2023-01-27T23:34:52.716+0200 [INFO] agent.router.manager: shutting down
2023-01-27T23:34:52.716+0200 [WARN] agent: [core][Channel #1 SubChannel #3] grpc: addrConn.createTransport failed to connect to {
"Addr": "dc1-172.17.0.6:8300",
"ServerName": "consul-server-0",
"Attributes": null,
"BalancerAttributes": null,
"Type": 0,
"Metadata": null
}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->172.17.0.6:8300: operation was canceled"
2023-01-27T23:34:52.845+0200 [WARN] agent: Check socket connection failed: check=_nomad-check-f2e9612ea567df3dbc29b35be3b594c8677d2285 error="dial tcp 0.0.0.0:4647: connectex: No connection could be made because the target machine actively refused it."
2023-01-27T23:34:52.845+0200 [WARN] agent: Check is now critical: check=_nomad-check-f2e9612ea567df3dbc29b35be3b594c8677d2285
2023-01-27T23:34:54.848+0200 [WARN] agent: Check socket connection failed: check=_nomad-check-c6cb6923aae3381410b855a6ff74be6e48515a94 error="dial tcp 0.0.0.0:4648: connectex: No connection could be made because the target machine actively refused it."
2023-01-27T23:34:54.848+0200 [WARN] agent: Check is now critical: check=_nomad-check-c6cb6923aae3381410b855a6ff74be6e48515a94
2023-01-27T23:34:55.714+0200 [INFO] agent.client.memberlist.lan: memberlist: Suspect consul-server-0 has failed, no acks received
2023-01-27T23:34:57.369+0200 [INFO] agent.client.serf.lan: serf: EventMemberJoin: consul-server-0 172.17.0.6
2023-01-27T23:34:57.369+0200 [INFO] agent.client: adding server: server="consul-server-0 (Addr: tcp/172.17.0.6:8300) (DC: dc1)"
2023-01-27T23:34:58.203+0200 [WARN] agent: Check is now critical: check=_nomad-check-d14931138a79b2e5023d96974631b2f4abd0e9f8
2023-01-27T23:34:58.648+0200 [ERROR] agent.client: RPC failed to server: method=Catalog.NodeServiceList server=172.17.0.6:8300 error="rpc error getting client: failed to get conn: dial tcp <nil>->172.17.0.6:8300: i/o timeout"
2023-01-27T23:34:58.648+0200 [ERROR] agent.anti_entropy: failed to sync remote state: error="rpc error getting client: failed to get conn: dial tcp <nil>->172.17.0.6:8300: i/o timeout"
2023-01-27T23:35:01.713+0200 [INFO] agent.client.memberlist.lan: memberlist: Suspect consul-server-0 has failed, no acks received
2023-01-27T23:35:04.597+0200 [WARN] agent: Check is now critical: check=_nomad-check-0687652acd1317d58ee6cc066752a52ca9970ff2
2023-01-27T23:35:04.849+0200 [WARN] agent: Check socket connection failed: check=_nomad-check-f2e9612ea567df3dbc29b35be3b594c8677d2285 error="dial tcp 0.0.0.0:4647: connectex: No connection could be made because the target machine actively refused it."
2023-01-27T23:35:04.849+0200 [WARN] agent: Check is now critical: check=_nomad-check-f2e9612ea567df3dbc29b35be3b594c8677d2285
2023-01-27T23:35:05.714+0200 [INFO] agent.client.memberlist.lan: memberlist: Marking consul-server-0 as failed, suspect timeout reached (0 peer confirmations)
2023-01-27T23:35:05.714+0200 [INFO] agent.client.serf.lan: serf: EventMemberFailed: consul-server-0 172.17.0.6
2023-01-27T23:35:05.714+0200 [INFO] agent.client: removing server: server="consul-server-0 (Addr: tcp/172.17.0.6:8300) (DC: dc1)"
2023-01-27T23:35:05.714+0200 [INFO] agent.router.manager: shutting down
2023-01-27T23:35:05.715+0200 [WARN] agent: [core][Channel #1 SubChannel #4] grpc: addrConn.createTransport failed to connect to {
"Addr": "dc1-172.17.0.6:8300",
"ServerName": "consul-server-0",
"Attributes": null,
"BalancerAttributes": null,
"Type": 0,
"Metadata": null
}. Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->172.17.0.6:8300: operation was canceled"
2023-01-27T23:35:06.853+0200 [WARN] agent: Check socket connection failed: check=_nomad-check-c6cb6923aae3381410b855a6ff74be6e48515a94 error="dial tcp 0.0.0.0:4648: connectex: No connection could be made because the target machine actively refused it."
2023-01-27T23:35:06.854+0200 [WARN] agent: Check is now critical: check=_nomad-check-c6cb6923aae3381410b855a6ff74be6e48515a94
2023-01-27T23:35:07.611+0200 [INFO] agent: Caught: signal=interrupt
2023-01-27T23:35:07.611+0200 [INFO] agent: Gracefully shutting down agent...
2023-01-27T23:35:07.611+0200 [INFO] agent.client: client starting leave
2023-01-27T23:35:07.612+0200 [INFO] agent.client.serf.lan: serf: EventMemberLeave: DESKTOP 192.168.56.1
2023-01-27T23:35:07.715+0200 [INFO] agent.client.memberlist.lan: memberlist: Suspect consul-server-0 has failed, no acks received
2023-01-27T23:35:10.206+0200 [WARN] agent: Check is now critical: check=_nomad-check-d14931138a79b2e5023d96974631b2f4abd0e9f8
2023-01-27T23:35:10.612+0200 [INFO] agent: Graceful exit completed
2023-01-27T23:35:10.612+0200 [INFO] agent: Requesting shutdown
2023-01-27T23:35:10.612+0200 [INFO] agent.client: shutting down client
2023-01-27T23:35:10.617+0200 [INFO] agent: consul client down
2023-01-27T23:35:10.617+0200 [INFO] agent: shutdown complete
2023-01-27T23:35:10.617+0200 [INFO] agent: Stopping server: protocol=DNS address=127.0.0.1:8600 network=tcp
2023-01-27T23:35:10.617+0200 [INFO] agent: Stopping server: protocol=DNS address=127.0.0.1:8600 network=udp
2023-01-27T23:35:10.617+0200 [INFO] agent: Stopping server: address=127.0.0.1:8500 network=tcp protocol=http
2023-01-27T23:35:10.618+0200 [INFO] agent: Waiting for endpoints to shut down
2023-01-27T23:35:10.618+0200 [INFO] agent: Endpoints down
2023-01-27T23:35:10.618+0200 [INFO] agent: Exit code: code=0
The client successfully joins the server but, for some reason it then tries to access the consul server internal pod IP on the rpc endpoint and errors are starting to occur.
Err: connection error: desc = "transport: Error while dialing dial tcp <nil>->172.17.0.6:8300: operation was canceled"
2023-01-27T23:34:58.648+0200 [ERROR] agent.anti_entropy: failed to sync remote state: error="rpc error getting client: failed to get conn: dial tcp <nil>->172.17.0.6:8300: i/o timeout"
How can I cause the server to advertise the IP of the node that it is running on instead of the pod IP?
Is this topology even possible? If not, what alternatives can I use to achieve this?
P.S. I don’t plan on having network access to all pods.