currently we are facing an issue with consul on k8s. It gets unstable after a while and I am not sure what’s the cause of the issue but the following error is shown on the servers
[ERROR] agent: Error starting agent: error=“Failed to start Consul server: Failed to start Raft: permissions test failed: open /consul/data/raft/snapshots/permTest: interrupted system call”
We are also running another cluster with the same consul versions but on kubernetes 1.17.11 and its stable there, so my guess is that it is related to the kubernetes version.
Hi,
What environment are you running k8s in? I have not seen that error before. I can’t see many results for it in Google either. I’m guessing this is an issue with the underlying platform unfortunately.
We are running consul on aks. I resolved the issue by removing the storageClass of the server which was set to azure file storage (kubernetes.io/azure-file).
As far as I understand this means if all consul servers go down, the state of the cluster and configuration is lost, are there any other drawbacks?
If that setting means that the disks are no longer persistent then yes, if the servers all go down then you will lose all your cluster data. I don’t think there are other drawbacks (although that’s a pretty big one!).
Have exactly the same issue. However, I do prefer to keep using persistent storage.
Background:
AKS
kubernetes 1.18.8
consul-k8s: 0.21.0
consul: 1.9.0
We first used Azure Disk, but this allows for ReadWriteOnce, meaning that if we upgrade the AKS cluster the Azure Disk is claimed by consul on an old k8s node and cannot be claimed by consul on a new node until it is released, which takes a very long time (5+ minutes).
Azure File allows for ReadWriteMany and works fine with any k8s version lower than 1.18.8. One major change with k8s 1.18.8 in AKS is that you switch from Ubuntu 16.04.5 to 18.04.5.
Any idea what might cause this issue and how to solve this?
We are facing the same issue. Seems that Azure Files causing instability for Consul. We changed Azure Files performance tier from Standard to Premium which helped little bit here (units of fails instead of dozens per day), but still quite unstable.
We are running AKS 1.18.8, Consul server 1.9.1 deployed via Helm chart 0.27.0.