1.7.4 snapshot issue

billyaustin84 · June 11, 2020, 7:08pm

1.7.4 will only allow me to save a snapshot if I explicitly specify -http-addr=<cluster leader>

If I don’t explicitly target the cluster leader then I get this error on the agent:
agent.http: Request error: method=GET url=/v1/snapshot?dc=test from=192.168.1.1:53874 error="failed to decode response: EOF"

And this error on the leader: agent.server.rpc: failed to read byte: conn=from=192.168.1.2:33561 error="tls: first record does not look like a TLS handshake"

The cli error is: Error saving snapshot: Unexpected response code: 500 (failed to decode response: EOF)

Has anybody else seen this? I could save snapshots successfully 100% of the time with 1.7.3.

i0rek · June 12, 2020, 8:00am

Thanks for reporting @billyaustin84! Could you show me your client and server configuration? I have an idea of what might be going on.

billyaustin84 · June 12, 2020, 10:46am

Hi @i0rek, thanks for the reply! Config snippets are as follows:

Server:
Common

{
  "acl": {
    "default_policy": "deny",
    "down_policy": "extend-cache",
    "enable_token_persistence": true,
    "enable_token_replication": true,
    "enabled": true,
    "tokens": {
      "agent": "<agent_token>",
      "master": "<master_token>"
    }
  },
  "ca_file": "/etc/hashicorp/ca.crt",
  "cert_file": "/etc/hashicorp/cert.crt",
  "data_dir": "/var/lib/consul",
  "datacenter": "primary-dc",
  "domain": "consul",
  "enable_syslog": true,
    "key_file": "/etc/hashicorp/key.crt",
  "ports": {
    "https": 8501
  },
  "primary_datacenter": "primary-dc",
  "server": true,
  "telemetry": {
    "prometheus_retention_time": "30s"
  },
  "ui": true,
  "retry_join": ["consul.service.primary-dc.consul"]
}

Node specific

{
  "bind_addr": "192.168.1.1",
  "client_addr": "192.168.1.1 127.0.0.1",
  "node_name": "node01"
}

Client:
Common

{
  "acl": {
    "default_policy": "deny",
    "down_policy": "extend-cache",
    "enabled": true,
    "tokens": {
        "default": "<default_token>"
    }
  },
  "data_dir": "/var/lib/consul",
  "datacenter": "primary-dc",
  "enable_syslog": true,
  "primary_datacenter": "primary-dc",
  "start_join": ["consul.service.primary-dc.consul"]
}

Node specific

{
  "bind_addr": "192.168.1.2",
  "client_addr": "127.0.0.1",
  "node_name": "client01"
}

Any help is gratefully received!

billyaustin84 · June 23, 2020, 7:10pm

It would have been nice to get to the bottom of this. But have recently upgraded to 1.8.0 and can confirm that the snapshot behaviour experienced with 1.7.4 no longer exists.
I haven’t changed my config(s) between 1.7.4 and 1.8.0 (and hadn’t from 1.7.3 into 1.7.4).
All is now working as expected.

i0rek · June 23, 2020, 8:38pm

Hi Billy,

thanks for reporting! I checked your config and I can’t see anything strange. I am glad it is working again. If I come across other issues like that I will try to update this thread.

acornies · July 15, 2020, 1:35pm

We have the exact same issue in Consul 1.7.4. Any idea? Or is answer upgrade or downgrade?

billyaustin84 · July 15, 2020, 1:42pm

I upgraded to 1.8.0 around 3 weeks ago. The problem doesn’t appear in this version, my snapshot script has worked fine since then.

The workaround in 1.7.4 we used was to target the snapshot directly at the cluster leader with consul snapshot save -http-addr=<cluster leader> file.dump

acornies · July 15, 2020, 1:56pm

Right - FWIW, using -stale seems to work as well in 1.7.4 but that may not work for everyone.

billyaustin84 · July 15, 2020, 2:05pm

Given that the command states it can give stale data it’s a somewhat less than ideal work around!

Topic		Replies	Views
"consul snapshot save" is proving to be unreliable Consul consul-snapshot	5	1233	March 3, 2021
Consul snapshot restore backup from a different VM (consul_version: 1.7.1) Consul	4	370	September 6, 2022
Using "Consul Snapshot" Remotely Consul	2	335	September 16, 2022
Consul snapshot files Consul	5	1228	September 19, 2021
OIDCDiscoveryURL and https Consul connect , vault	0	374	January 28, 2023

1.7.4 snapshot issue

Related topics