Hi guys,
I am looking to upgrade our consul instances, the servers and agents are in version 1.8.6.
We use consul only for service discovery ( no servicemesh , no acl, no encryption)
My upgrade path based on the documentation looks like this:
Unfortunately their usual response is to refuse to commit, and respond with some flavour of âyou can test it yourself if you like but we still recommend whatâs in the docsâ.
Even when pressed, they wonât give me solid technical justifications for the specific intermediate versions they have selected.
I have personally chosen to ignore the instructions to upgrade from 1.8.1 to latest 1.8.x before moving onwards in an upgrade I worked on, because I was able to determine from changelogs and supporting documentation, that that only applied to certain Enterprise licensing configurations. We moved straight from that to the latest 1.10.x at the time.
From 1.10.x to 1.14, I have no personal experience to share, although I see no reason why hopping straight to 1.14 couldnât work, having reviewed changelogs - hence why I was trying, unsuccessfully, to get a yes or no from HashiCorp about whether there were any actual technical blockers.
During testing:
Able to upgrade from 1.8.9 to 1.16.3 â didnât face any issues.
Observations:
backward compatibility to 1.8.9 is not possible.
new consul server with 1.8.9 not able to join cluster which is already upgraded to 1.16.3.
Error: [ERROR] agent.server.raft: failed to restore snapshot: error=âfailed to restore snapshot 4-16384-1702446548539: Unrecognized msg type 31â
Note: It is showing up in consul members but looks like data restoration is failing.
other errors during upgrade:
2023-12-11T12:08:23.379Z [WARN] agent: using enable-script-checks without ACLs and without allow_write_http_from is DANGEROUS, use enable-local-script-checks instead, see Protecting Consul from RCE Risk in Specific Configurations
2023-12-11T12:08:23.385Z [WARN] agent.auto_config: using enable-script-checks without ACLs and without allow_write_http_from is DANGEROUS, use enable-local-script-checks instead, see Protecting Consul from RCE Risk in Specific Configurations
2023-12-11T12:06:25.494Z [ERROR] agent.server.cert-manager: failed to handle cache update event: error=âleaf cert watch returned an error: rpc error making call: Connect must be enabled in order to use this endpointâ
023-12-11T12:06:13.047Z [WARN] agent: error getting server health from server: server=consul-2 error=âcontext deadline exceededâ
2023-12-11T12:06:17.916Z [WARN] agent.leaf-certs: handling error in Manager.Notify: error=ârpc error making call: Connect must be enabled in order to use this endpointâ index=1
We are getting following errors:
1.8.9
2024-03-19T20:33:42.915Z [ERROR] agent.server.rpc: unrecognized RPC byte: byte=8 conn=from=x.x.x.x:44244
1.16.4 â wrt to wan
whenever we do restart of consul, wan is not able to connect to other instances sometimes (transient)
Deleting and adding new instance fixed the issue.
In 1.16.4 Consul server:
2024-03-12T20:33:46.101Z [WARN] agent: [core][Channel #1 SubChannel #61] grpc: addrConn.createTransport failed to connect to {Addr: âx.x.x.x:8300â, ServerName: âaaaaaaaaâ, }. Err: connection error: desc = âerror reading server preface: EOFâ
While doing upgrade we got acl error though acl is not enabled.
2024-03-19T17:36:18.967Z [ERROR] agent.server.raft: failed to restore snapshot: id=153483-900786585-1710868811563 last-index=911702585 last-term=153343 size-in-bytes=1800899 error=âfailed inserting acl token: missing value for index âaccessorââ
Can you please guide.
Is it possible to safely ignore above errors?