Consul RPC Communication error

I have a consul cluster of 3 servers running on AWS in US-East. I have a consul agent running on AWS in the same VPC as the consul server that is able to communicate with the Consul Cluster. However, when I try to set up a Windows Consul Agent outside of the VPC, it is unable to communicate with the Consul Cluster. I am getting the following error:

2024-02-06T17:25:41.986Z [ERROR] agent.client: RPC failed to server: method=Catalog.NodeServiceList server=10.1.129.176:8300 error="rpc error getting client: failed to get conn: dial tcp 172.31.79.106:0->10.1.129.176:8300: i/o timeout"
2024-02-06T17:25:41.986Z [ERROR] agent.anti_entropy: failed to sync remote state: error="rpc error getting client: failed to get conn: dial tcp 172.31.79.106:0->10.1.129.176:8300: i/o timeout"
2024-02-06T17:25:41.986Z [ERROR] agent.client: RPC failed to server: method=ConnectCA.Roots server=10.1.129.176:8300 error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
2024-02-06T17:25:51.987Z [ERROR] agent.client: RPC failed to server: method=ConnectCA.Roots server=10.1.131.14:8300 error="rpc error getting client: failed to get conn: dial tcp 172.31.79.106:0->10.1.131.14:8300: i/o timeout"
2024-02-06T17:25:56.437Z [ERROR] agent.client: RPC failed to server: method=Catalog.NodeServiceList server=10.1.127.110:8300 error="rpc error getting client: failed to get conn: dial tcp 172.31.79.106:0->10.1.127.110:8300: i/o timeout"
2024-02-06T17:25:56.437Z [ERROR] agent.client: RPC failed to server: method=ConnectCA.Roots server=10.1.127.110:8300 error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
2024-02-06T17:25:56.437Z [ERROR] agent.anti_entropy: failed to sync remote state: error="rpc error getting client: failed to get conn: dial tcp 172.31.79.106:0->10.1.127.110:8300: i/o timeout"
2024-02-06T17:26:06.438Z [ERROR] agent.client: RPC failed to server: method=ConnectCA.Roots server=10.1.129.176:8300 error="rpc error getting client: failed to get conn: dial tcp 172.31.79.106:0->10.1.129.176:8300: i/o timeout"
2024-02-06T17:26:06.438Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=10.1.129.176:8300 error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
2024-02-06T17:26:06.438Z [ERROR] agent: Coordinate update error: error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"

IP addresses in 10.1.0.0/8 CIDR block belong to Consul servers in the cluster. I am not quite sure why the Consul Agent is using the private IP addresses for communication when the retry_join parameter in the config actually provides the public IP address of the server. Also, I have verified that port 8300-8302 allow inbound TCP/UDP traffic on the Consul Servers and Agents. Any help or guidance will be really appreciated.

Hi @mefqpq193,

Consul agents require the rest of the agents’ addresses to be routable. In your case, the server agents are advertising their private IP address; as a result, all the other nodes will use that IP for communication.

There is an option by which you can make the agents advertise a different IP (in your case, public IP), but that would mean that your agents in the same VPC would also have to use the public IP to form the cluster.

ref: Agents - CLI Reference | Consul | HashiCorp Developer

In general, Consul expects the clusters to have flat/routable networks.

However, there are Enterprise features that would make these kinds of topologies easier.
ref: Network Segments Overview | Consul | HashiCorp Developer

I have tried changing the address as recommended but still no avail. All the required ports are also open on both the server and agent. Here is the config file for the server and the agent:

datacenter          = "US"
server              = true
bootstrap_expect    = 3
data_dir            = "/opt/consul/data"
bind_addr           = "<private_ip>"
advertise_addr      = "<private_ip>"
advertise_addr_wan  = "<public_ip>"
client_addr         = "0.0.0.0"
log_level           = "INFO"
ui_config {
  enabled = true
  content_path = "/ui/"
}

auto_encrypt {
  allow_tls = true
}

tls { defaults { ca_file = "/etc/consul.d/consul-agent-ca.pem" 
cert_file="/etc/consul.d/dc1-server-consul-0.pem"
key_file = "/etc/consul.d/dc1-server-consul-0-key.pem"
verify_incoming = true
verify_outgoing = true
} }
# AWS cloud join
retry_join          = ["provider=aws tag_key=Environment-Name tag_value=us-east-1-consul service=ec2 addr_type=private_v4 region=us-east-1"]
retry_join_wan      = ["provider=aws tag_key=Environment-Name tag_value=us-east-1-consul service=ec2 addr_type=public_v4 region=us-east-1"]

# Max connections for the HTTP API
limits {
  http_max_conns_per_client = 128
}
performance {
    raft_multiplier = 1
}

acl {
  enabled        = true
  default_policy = "allow"
  enable_token_persistence = true
  tokens {
    initial_management = "<key>"
  }
}

encrypt = "<key>"

And the agent config file:

datacenter  = "US"
data_dir    = "/opt/consul/data"
client_addr = "127.0.0.1"
node_name   = "test-node"
log_level   = "INFO"
ui_config {
  enabled = false
}
server         = false
bind_addr      = "0.0.0.0" # Listen on all IPv4
advertise_addr = "<public_ip>"
encrypt        = "<key>"
ports {
  serf_lan = 8301
  serf_wan = 8302
}

auto_encrypt {
  tls = true
}

tls {
  defaults {
    verify_incoming = true
    verify_outgoing = true
    ca_file         = "/etc/consul.d/consul-agent-ca.pem"
  }

  internal_rpc {
    verify_server_hostname = true
  }

}

retry_join = [<consul_server_ip>]


leave_on_terminate         = true
rejoin_after_leave         = true
enable_local_script_checks = true

domain = "<domain>"

Hi @mefqpq193

I see you have used the <public_ip> for the advertise_addr_wan, which only applies to WAN-joined clusters. You have to do it for the advertise_addr.

Please note that if the servers can only talk to clients over Public IP, you must do the same for the Clients so that every agent in the clusters can reach the other over their respective advertised IP. However, the downside is that your cross-node traffic will always be over the public internet, which may not be what you intend.

I would recommend exploring the options where you can have all your agents with routable IPs from each subnet.