Wrong Consul token sometimes used for deregistering checks/services

We’ve been using Nomad since 0.5 times with the same Consul token in the consul stanza in the nomad config.

On Dec 8th (must’ve been the last 0.x release before 1.0, because we’ve upgraded to 1.0 the next day) we’ve had the incident for the first time that Consul suddenly couldn’t deregister a service, after it was taken down:

Dec  8 17:01:06 h-0058 consul[2769]:     2020-12-08T17:01:06.618+0100 [WARN]  agent: Check deregistration blocked by ACLs: check=_nomad-check-c60a65be6daf993b18bfac2c19166ab978e7e968 accessorID=00000000-0000-0000-0000-000000000002

The token with accessorID 2 is the anonymous default token on the host. Not what is configured in nomad.hcl. It only happened to some services, not all.

Back then, I could fix it by restarting Consul (not Nomad). I don’t remember the version of Consul back then, but it must’ve been the latest one at that time since we use the Hashi PPA.

When updating from 1.0.0 to 1.0.1 and draining a different node, this happened again, but since the server was rebooted anyways, it wasn’t a problem.

Today with Nomad 1.0.1 and Consul 1.9.1 I ran into it while deploying an application with ~80 groups. 2 of them just can’t be deregistered while the rest went fine:

Dec 20 15:00:56 h-0058 consul[3641]:     2020-12-20T15:00:56.252+0100 [INFO]  agent: Synced check: check=_nomad-check-ee227a773016853048375e80bd1a416b4c8c4657
Dec 20 15:01:13 h-0058 consul[3641]:     2020-12-20T15:01:13.215+0100 [ERROR] agent.client: RPC failed to server: method=Catalog.Deregister server=10.5.10.1:8300 error="rpc error making call: rpc error making call: Permission denied"
Dec 20 15:01:13 h-0058 consul[3641]:     2020-12-20T15:01:13.289+0100 [WARN]  agent: Service deregistration blocked by ACLs: service=_nomad-task-d85ecefc-abcb-3e09-8259-15a8625a40c0-backend-open-contact-reg-proxy-backend-open-contact-http accessorID=00000000-0000-0000-0000-000000000002
Dec 20 15:01:13 h-0058 consul[3641]:     2020-12-20T15:01:13.291+0100 [ERROR] agent.client: RPC failed to server: method=Catalog.Deregister server=10.5.10.2:8300 error="rpc error making call: rpc error making call: Permission denied"
Dec 20 15:01:13 h-0058 consul[3641]:     2020-12-20T15:01:13.292+0100 [WARN]  agent: Service deregistration blocked by ACLs: service=_nomad-task-f4621866-e591-43ac-10b6-999ed8b091ce-backend-nask-reg-proxy-backend-nask-http accessorID=00000000-0000-0000-0000-000000000002
Dec 20 15:01:13 h-0058 consul[3641]:     2020-12-20T15:01:13.292+0100 [ERROR] agent.client: RPC failed to server: method=Catalog.Deregister server=10.5.10.3:8300 error="rpc error making call: Permission denied"
Dec 20 15:01:13 h-0058 consul[3641]:     2020-12-20T15:01:13.293+0100 [WARN]  agent: Check deregistration blocked by ACLs: check=_nomad-check-a139d98f3198cba5bf9623a66347bd599d886bb2 accessorID=00000000-0000-0000-0000-000000000002
Dec 20 15:01:13 h-0058 consul[3641]:     2020-12-20T15:01:13.294+0100 [ERROR] agent.client: RPC failed to server: method=Catalog.Deregister server=10.5.10.1:8300 error="rpc error making call: rpc error making call: Permission denied"
Dec 20 15:01:13 h-0058 consul[3641]:     2020-12-20T15:01:13.295+0100 [WARN]  agent: Check deregistration blocked by ACLs: check=_nomad-check-f4e4fbe704944bf340048474cef18c19f9a4208b accessorID=00000000-0000-0000-0000-000000000002
Dec 20 15:02:44 h-0058 consul[3641]:     2020-12-20T15:02:44.963+0100 [ERROR] agent.client: RPC failed to server: method=Catalog.Deregister server=10.5.10.2:8300 error="rpc error making call: rpc error making call: Permission denied"
Dec 20 15:02:44 h-0058 consul[3641]:     2020-12-20T15:02:44.965+0100 [WARN]  agent: Service deregistration blocked by ACLs: service=_nomad-task-d85ecefc-abcb-3e09-8259-15a8625a40c0-backend-open-contact-reg-proxy-backend-open-contact-http accessorID=00000000-0000-0000-0000-000000000002
Dec 20 15:02:44 h-0058 consul[3641]:     2020-12-20T15:02:44.966+0100 [ERROR] agent.client: RPC failed to server: method=Catalog.Deregister server=10.5.10.3:8300 error="rpc error making call: Permission denied"
Dec 20 15:02:44 h-0058 consul[3641]:     2020-12-20T15:02:44.966+0100 [WARN]  agent: Service deregistration blocked by ACLs: service=_nomad-task-f4621866-e591-43ac-10b6-999ed8b091ce-backend-nask-reg-proxy-backend-nask-http accessorID=00000000-0000-0000-0000-000000000002
Dec 20 15:02:44 h-0058 consul[3641]:     2020-12-20T15:02:44.968+0100 [ERROR] agent.client: RPC failed to server: method=Catalog.Deregister server=10.5.10.1:8300 error="rpc error making call: rpc error making call: Permission denied"
Dec 20 15:02:44 h-0058 consul[3641]:     2020-12-20T15:02:44.968+0100 [WARN]  agent: Check deregistration blocked by ACLs: check=_nomad-check-f4e4fbe704944bf340048474cef18c19f9a4208b accessorID=00000000-0000-0000-0000-000000000002
Dec 20 15:02:44 h-0058 consul[3641]:     2020-12-20T15:02:44.969+0100 [ERROR] agent.client: RPC failed to server: method=Catalog.Deregister server=10.5.10.2:8300 error="rpc error making call: rpc error making call: Permission denied"
Dec 20 15:02:44 h-0058 consul[3641]:     2020-12-20T15:02:44.969+0100 [WARN]  agent: Check deregistration blocked by ACLs: check=_nomad-check-a139d98f3198cba5bf9623a66347bd599d886bb2 accessorID=00000000-0000-0000-0000-000000000002

Nomad just says many times this:

Dec 20 15:00:37 h-0058 nomad[3664]:     2020-12-20T15:00:37.947+0100 [DEBUG] consul.sync: sync complete: registered_services=2 deregistered_services=0 registered_checks=1 deregistered_checks=0

and this:

Dec 20 15:00:42 h-0058 nomad[3664]:     2020-12-20T15:00:42.462+0100 [DEBUG] consul.sync: sync complete: registered_services=0 deregistered_services=2 registered_checks=0 deregistered_checks=0

No errors. And the service/group is long gone.

So again, for some reason it’s using the wrong Consul token for 2 / ~80 groups.

The fact it’s the same node is a coincidence, since all nodes are configured the same way via Ansible and the upgrade drain when updating to 1.0.1 happened on a different one.

This time, restarting Consul and Nomad didn’t help. I had to drain and reboot the node. Can’t really tell wha’s going wrong there. Is Nomad submitting the wrong token or is Consul somehow “forgetting” about it?

Hi @hynek! That’s very strange… I’m wondering if you’re running into the issue described in consul/#8079. I don’t see any changes to our Consul ACL handling that that should be triggering this especially. Would you be willing to open a GitHub issue describing this?

Nomad doesn’t really do much with the consul.token other than read it from config and pass it along to Consul’s own api/ library. We do read the value from the config object again in a few places for detecting ACLs being enabled and template, but at first glance I don’t see a path to changing the value in the established Consul API client.

What exactly do you need additionally to what I’ve provided? Because I gave you all I have. :man_shrugging:

The linked issue looks indeed like my problem: it seems like nomad sometimes forgets to pass a token and consul uses it’s default (or none if none is defined as in the issue).

Another option of course is that Consul screws up the token handling and drops the passed one on the floor.

We are seeing this issue as well. Running Nomad v1.0.1 and Consul v1.9.1

1 Like

@hynek we opened an issue on consul to track it: Unable to deregister orphan checks and service instances after accessorid token is lost and set to anonymous · Issue #9577 · hashicorp/consul · GitHub

If you have more context on the issue feel free to add more details there.

1 Like