1.7.4 snapshot issue

1.7.4 will only allow me to save a snapshot if I explicitly specify -http-addr=<cluster leader>

If I don’t explicitly target the cluster leader then I get this error on the agent:
agent.http: Request error: method=GET url=/v1/snapshot?dc=test from=192.168.1.1:53874 error="failed to decode response: EOF"

And this error on the leader: agent.server.rpc: failed to read byte: conn=from=192.168.1.2:33561 error="tls: first record does not look like a TLS handshake"

The cli error is: Error saving snapshot: Unexpected response code: 500 (failed to decode response: EOF)

Has anybody else seen this? I could save snapshots successfully 100% of the time with 1.7.3.

Thanks for reporting @billyaustin84! Could you show me your client and server configuration? I have an idea of what might be going on.

Hi @i0rek, thanks for the reply! Config snippets are as follows:

Server:
Common

{
  "acl": {
    "default_policy": "deny",
    "down_policy": "extend-cache",
    "enable_token_persistence": true,
    "enable_token_replication": true,
    "enabled": true,
    "tokens": {
      "agent": "<agent_token>",
      "master": "<master_token>"
    }
  },
  "ca_file": "/etc/hashicorp/ca.crt",
  "cert_file": "/etc/hashicorp/cert.crt",
  "data_dir": "/var/lib/consul",
  "datacenter": "primary-dc",
  "domain": "consul",
  "enable_syslog": true,
    "key_file": "/etc/hashicorp/key.crt",
  "ports": {
    "https": 8501
  },
  "primary_datacenter": "primary-dc",
  "server": true,
  "telemetry": {
    "prometheus_retention_time": "30s"
  },
  "ui": true,
  "retry_join": ["consul.service.primary-dc.consul"]
}

Node specific

{
  "bind_addr": "192.168.1.1",
  "client_addr": "192.168.1.1 127.0.0.1",
  "node_name": "node01"
}

Client:
Common

{
  "acl": {
    "default_policy": "deny",
    "down_policy": "extend-cache",
    "enabled": true,
    "tokens": {
        "default": "<default_token>"
    }
  },
  "data_dir": "/var/lib/consul",
  "datacenter": "primary-dc",
  "enable_syslog": true,
  "primary_datacenter": "primary-dc",
  "start_join": ["consul.service.primary-dc.consul"]
}

Node specific

{
  "bind_addr": "192.168.1.2",
  "client_addr": "127.0.0.1",
  "node_name": "client01"
}

Any help is gratefully received!

It would have been nice to get to the bottom of this. But have recently upgraded to 1.8.0 and can confirm that the snapshot behaviour experienced with 1.7.4 no longer exists.
I haven’t changed my config(s) between 1.7.4 and 1.8.0 (and hadn’t from 1.7.3 into 1.7.4).
All is now working as expected.

Hi Billy,

thanks for reporting! I checked your config and I can’t see anything strange. I am glad it is working again. If I come across other issues like that I will try to update this thread.

We have the exact same issue in Consul 1.7.4. Any idea? Or is answer upgrade or downgrade?

I upgraded to 1.8.0 around 3 weeks ago. The problem doesn’t appear in this version, my snapshot script has worked fine since then.

The workaround in 1.7.4 we used was to target the snapshot directly at the cluster leader with consul snapshot save -http-addr=<cluster leader> file.dump

1 Like

Right - FWIW, using -stale seems to work as well in 1.7.4 but that may not work for everyone.

Given that the command states it can give stale data it’s a somewhat less than ideal work around! :laughing: