Service Discovery issues

Hello,

I hope you can help, or at least guide me in the right direction. I have 4 host machines (docker containers). I am running some tests against this environment. But my consul behaves strangely.

  • consul.hello.com host: consul agent server
  • user-service-1.hello.com: consul agent client + REST API + service discovery + health check
  • user-service-2.hello.com: identical with user-service-1, same docker container
  • hello-service-1.hello.com: consul agent client + REST API that calls user-service

Everything works fine, when I start my docker stack the REST services register themselves properly:
consule-before-02 - services - summary

user-service-1 details:

When hello-service asks for the URL of the user-service, consul returns with two URLs: user-service-1.hello.com and user-service-2.hello.com.

Then I stopped user-service-1.hello.com container completely, and consul recognized this event properly:

BUT, When hello-service asks for the URL of the user-service, consul still returns with user-service-1.hello.com despite this “machine” has been stopped completely. Not just the REST service in this host, but the host itself has been “turned off”:

user-service-1: no answer:

[root@consul.hello.com]# wget -q -S -O - https://user-service-1.hello.com:8443/actuator/health
^C

user-service-2: OK:

[root@consul.hello.com]# wget -q -S -O - https://user-service-2.hello.com:8443/actuator/health
  HTTP/1.1 200 
  Content-Type: application/vnd.spring-boot.actuator.v3+json
  Transfer-Encoding: chunked
  Date: Tue, 12 Mar 2024 20:22:40 GMT
  Connection: close
{"status":"UP"}

The REST services are Java Spring-Boot applications.

I have two questions:

  1. As I see from Java when the hello-service asks the consul for user-service endpoint URL, it returns with 2 URLs and my Java app needs to do a random selection between the two (some kind of round-robin algorithm:). Can I configure consul somehow to return with only one URL based on a random selection?.

  2. Why console does not recognize that user-service-1 is dead? Consul only recognizes that the consul agent client is not running there but consul still offers the dead URL as well in the response. Why?

Hi @zappee,

When the agent is down, the service instances registered against those agents won’t be included in the DNS response. Are you sure you are querying the Consul DNS?

Could you share the output of the following command from one of your Consul nodes?

dig @localhost -p 8600 user-service.service.consul

The above command assumes the following, and you should make the necessary changes if you have customised them:

  • The Consul DNS port is 8600
  • The Consul domain is consul

I logged in to consul agent server and I run the commands. Result:

(1)

[root@consul.hello.com]# dig @localhost -p 8600 consul.hello.com

; <<>> DiG 9.18.24 <<>> @localhost -p 8600 consul.hello.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 11861
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;consul.hello.com.		IN	A

;; AUTHORITY SECTION:
hello.com.		0	IN	SOA	ns.hello.com. hostmaster.hello.com. 1710290864 3600 600 86400 0

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(localhost) (UDP)
;; WHEN: Wed Mar 13 00:47:44 UTC 2024
;; MSG SIZE  rcvd: 95

(2)

[root@consul.hello.com]# dig @localhost -p 8600 user-service-2.hello.com

; <<>> DiG 9.18.24 <<>> @localhost -p 8600 user-service-2.hello.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 49274
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;user-service-2.hello.com.	IN	A

;; AUTHORITY SECTION:
hello.com.		0	IN	SOA	ns.hello.com. hostmaster.hello.com. 1710290970 3600 600 86400 0

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(localhost) (UDP)
;; WHEN: Wed Mar 13 00:49:30 UTC 2024
;; MSG SIZE  rcvd: 103

Sorry, but I do not understand clearly your question regarding DNS. I use docker containers and the communication between the containers (hosts) goes through the docker network, inside the docker.

Hi @zappee,

You should read more about the Service Discovery aspect of Consul. Check out the following documentation that would help you understand this better:

The TLDR is that if you want to talk only to healthy instances of your service, you should query those services against Consul (in this case, Consul DNS). Consul will remove unhealthy instances from the responses, thereby ensuring that the client application only gets healthy instances in the DNS lookup.

This would require you to make your container talk to Consul DNS. This doc talks about VM hosts, you will have find out how to do it with your current setup.

ref: https://developer.hashicorp.com/consul/tutorials/networking/dns-forwarding

I hope this helps!

@zappee It sounds like you need to configure Spring Boot to only query services that have health checks in the passing state. (e.g., /v1/health/service/:name?passing)

I believe you can do this by setting spring.cloud.consul.discovery.query-passing = true.

See Spring Cloud Consul’s Common application properties for a list of supported configuration parameters.

Hi all,

@Ranjandas
Thanks for sharing the docs with me. I went through them 2-3 times and now I understand better how DNS and the related consul “staff” works.

But it is still not clear how this will solve my issue. I use Java to implement the REST services, that means there is an extra layer between consul and my code: spring-boot. So I am not calling directly the consul endpoints. The DNS doc mentions this:

The Consul DNS is the primary interface for discovering services registered in the Consul catalog. The DNS enables you to look up services and nodes registered with Consul using terminal commands instead of making HTTP API requests to Consul.

This is usefull to know how it works and helps me to test things, but my “production” code uses Java spring.

But what I am not sure is this: I have read in the doc that Hashicorp recommends to use hostnames ends with .consul because this FQDN suffix does some magic. So in my case my hostnames can look like this?

  • user-service-1.hello.com.consul
  • user-service-2.hello.com.consul
  • hello-service-1.hello.com.consul

I hope that this is not mandatory because I would like to keep using my original FQDN naming convention which is <xxxx>.hello.com.

I have seen this .consul hostname suffix eralier when I configured consule agent server and clients and I had to turn off the verify_server_hostname under the tls. The relevant agent server settings I use:

{
  "datacenter": "consul",
  "node_name": "${CONSUL_NODE_NAME}",
  "server_name": "${FQDN}",
  "domain": "${DOMAIN}",
  "tls": {
    "defaults": {
      "key_file": "${KEYSTORE_HOME}/${FQDN}.pem",
      "cert_file": "${KEYSTORE_HOME}/${FQDN}.crt",
      "ca_file": "${KEYSTORE_HOME}/ca.crt",
      "verify_incoming": false,
      "verify_outgoing": true,
      "verify_server_hostname": false
    }
  },
}

where:

[DEBUG]             node name: "node-hello-service-1.hello.com"
[DEBUG]    consul server host: "consul.hello.com"
[DEBUG]         keystore home: "/tmp"
[DEBUG]                  fqdn: "hello-service-1.hello.com"
[DEBUG]                domain: "hello.com"

But I am still not sure how the .consul DNS suffix works, I just turned off the verify_server_hostname flag.

@blake
Thanks for the help. I am going to test spring.cloud.consul.discovery.query-passing = true and let you now the result.

Hi @zappee,

I overlooked the fact that you were using spring-boot. The recommendation that @blake share should work in that case.

Thanks for @blake I was able to solve the 1st issue.

But unfortunately, I am not able to solve the 2nd issue which is related to the round-robin DNS response. Consul always returns with two endpoint URLs ordered in the same way. I have checked many different docs, and the best I have found is this: limit the size of the result

I set the a_record_limit to 1, but as I see, this setting has no impact in my case. I still get back two URLs in the response, not one:

"dns_config": {
   "a_record_limit": 1
}

I also tried to turn off Java DNS caching but did not help.

What else can I try to make the round-robin work properly?

I use this configuration in consule server/client agent:

  "dns_config": {
    "service_ttl": {
      "*": "0s"
    },
    "node_ttl": "0s",
    "a_record_limit": 1,
    "allow_stale": false,
    "recursor_strategy": "random",
    "udp_answer_limit": 1,
    "use_cache": false
  }

This is the spring config I use everywhere:

# hashicorp consul
spring.config.import=consul:
spring.cloud.consul.host=localhost
spring.cloud.consul.port=8500
spring.cloud.consul.discovery.instanceId=${spring.application.name}
spring.cloud.consul.discovery.tags=${spring.application.version}
spring.cloud.consul.discovery.scheme=https
spring.cloud.consul.discovery.health-check-interval=2s
spring.cloud.consul.discovery.health-check-timeout=2s
spring.cloud.consul.discovery.health-check-critical-timeout=10s
spring.cloud.consul.discovery.query-passing=true

Java code I use to get the endpoint URI:

private Optional<URI> getServiceUrl(String serviceId) {
    List<ServiceInstance> serviceInstances = discoveryClient.getInstances("user-service");
    log.debug(
            "service instances: {}",
            serviceInstances.stream().map(x -> x.getUri().toString()).collect(Collectors.joining(", ")));

    Optional<URI> uri = serviceInstances
            .stream()
            .findFirst()
            .map(ServiceInstance::getUri);
    return uri;
}

The result is allways the same, same order:

c.r.g.t.rest.controller.HelloController  : service instances: https://user-service-1.hello.com:8443, https://user-service-2.hello.com:8443
c.r.g.t.rest.controller.HelloController  : service instances: https://user-service-1.hello.com:8443, https://user-service-2.hello.com:8443

Java uses List type to store the response from consul discovery and List is an ordered collection, it keeps the order:

List<ServiceInstance> serviceInstances = discoveryClient.getInstances("user-service");

Okay, I can simulate the round-robin at Java (client) level, but this looks not really nice. The different types of “proxy” that Hashicorp implemented is more sophisticated and that would be great if I can use them. Hashicorp invested time to implement them properly so I would like to use them.

List<ServiceInstance> serviceInstances = discoveryClient.getInstances("user-service");
int randomIndex = ThreadLocalRandom.current().nextInt(serviceInstances.size());
Optional<URI> uri = Optional.of(serviceInstances.get(randomIndex).getUri());

What I am missing here?
Why consul does not do round-robin in the response between the two available REST endpoint URLs as it is writen in the official docs?

My docker containers are based on Alpine Linux.

Thanks a lot in advance.