Unstable deployment on K8s with helm chart

Hello,

I have been trying out multiple deployment modes for Consul. I have sucessfully deployed a multi-DC via WAN federation using autoscaling groups in AWS, and now I’m moving to try out deploying on top of K8s.

The experience was pretty smooth with ASG; in the end I got a pretty stable setup where I could add/remove nodes at will and the datacenters reacted smoothly (e.g, consensus was impeccable).

However, I have been battling some issues with K8s. So far, I got a couple of issues that make me wonder if running on top of K8s in a production scenario is plausible and better than ASGs:

  1. changing the extraConfig setting doesn’t seem to trigger a pod refresh
  2. when scaling up from 3 to 5 replicas, for some reason I lost consensus and Raft took over 10minutes to reach one
  3. when scaling down consensus was also lost (which might be ok if the leader was killed), but the consensus also took a long time.
  4. when scaling down to a single node (from 5 nodes) “consensus” was not reached yet, after 30 minutes.
  5. upgrading my helm chart to the latest version doesn’t refresh and upgrades the server pods. It seems that during the upgrade, only the init jobs were executed, everything else stayed the same.

I’m probably missing something that could explain these issues, as I am fairly new working with K8s.

I added my config.yamls chart configuration below for reference (I know most of the values there are the default ones, but I just wanted to have a yaml with the full configs so I could tweak incrementally)

I also believe that I read somewhere that Hashicorp officially suggests to not deploy Consul on K8s, although it’s possible. Is this a fair statement? I also think that an highly dynamic environment such K8s is not the best fit for a stateful system like Consul (at least for the server nodes).

global:
  enabled: true
  logLevel: "debug"
  logJSON: false
  name: "dlo"
  datacenter: "dlo"
  consulAPITimeout: "5s"
  enablePodSecurityPolicies: true
  recursors: []
  tls:
    enabled: true
    enableAutoEncrypt: true
    serverAdditionalDNSSANs: []
    serverAdditionalIPSANs: []
    verify: true
    httpsOnly: true
    caCert:
      secretName: null
      secretKey: null
    caKey:
      secretName: null
      secretKey: null
  acls:
    manageSystemACLs: true
    bootstrapToken:
      secretName: null
      secretKey: null
    createReplicationToken: true
    replicationToken:
      secretName: null
      secretKey: null
  gossipEncryption:
    autoGenerate: true
  federation:
    enabled: false
    createFederationSecret: false
    primaryDatacenter: null
    primaryGateways: []
    k8sAuthMethodHost: null
  metrics:
    enabled: false
    enableAgentMetrics: false
    agentMetricsRetentionTime: "1m"
    enableGatewayMetrics: true

server:
  replicas: 1
  bootstrapExpect: 1
  updatePartition: 1 # TODO: for some reason, if setting to the 'number of replicas' things get broken on minikube
  #affinity: null # for minikube, set null
  connect: true # setup root CA and certificates
  extraConfig: |
    {
      "log_level": "DEBUG",
      "log_file": "/consul/",
      "log_rotate_duration": "24h",
      "log_rotate_max_files": 7
    }

client:
  enabled: false
  affinity: null
  updateStrategy: |
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate
  extraConfig: |
    {
      "log_level": "DEBUG"
    }

ui:
  enabled: true
  service:
    enabled: true
    type: LoadBalancer
    port:
      http: 80
      https: 443
  metrics:
    enabled: false
  ingress:
    enabled: false

dns:
  enabled: false

externalServers:
  enabled: false

syncCatalog:
  enabled: false

connectInject:
  enabled: false

controller:
  enabled: false

meshGateway:
  enabled: false

ingressGateways:
  enabled: false

terminatingGateways:
  enabled: false

apiGateway:
  enabled: false

webhookCertManager:
  tolerations: null

prometheus:
  enabled: false

Ok, so I have misinterpreted the “server.partitions” setting completely. The pods were not being refreshed because I always set the partitions value equal to the number of replicas.

I will re-run my setup next week and report back. :sweat_smile:

So, after fixing the server.partitions setting, the pods started to refresh when I upgraded the helm chart as expected.

However, I keep having a lot of stability issues on the Raft consensus when addindg nodes.

  1. increasing from 3 replicas to 5 replicas for some reason makes Raft lose consensus. I don’t undertstand this, why would Raft lose consensus when increasing the replicas?
  2. with a 5-node deployment, changing a configuration and then upgrading helm once again makes the consensus to be lost.
  3. how can I automate the servers upgrade process without downtime? Official documentation mentions that I should manipulate the server.partitions setting in multiple phases. However, in a “real” scenario, in which deploys are managed by CI/CD tools, does it mean I need to do multiple commits and multiple deploys to ensure the server all receive the upgrade? It does sound a bit unproductive. Are there any other alternatives to this, while still using the official Helm chart?
  4. the helm chart is using deprecated settings, both from Consul as well as K8s settings. Is this a known issue?

One of the issues I believe I detected is that, because pods are named like consul-server-1, consul-server-2, etc, when pods are refreshing, they came up with the same name as the previous one (instead of generating a random suffix for example). This makes some members of the consensus protocol to detect two node with the same name (e.g., consul-server-2) running on different IPs (one from the new pod, and one from the old pod that was just replaced). But because this happens very quickly, the nodes don’t have time to “forget” the old pod and this causes a naming conflict.

I’m trying to understand if deploying Consul via Helm in K8s is production ready, but so far I have found some concerning issues. I wanted to understand if this is just me being dumb and missing some configuration or not. Because if not, it feels that deploying on top of VMs (an ASG for instance) is much more stable.

Can someone point me in the right direction? Am I missing something?

Regarding this issue

One of the issues I believe I detected is that, because pods are named like consul-server-1, consul-server-2, etc, when pods are refreshing, they come up with the same name as the previous one (instead of generating a random suffix for example). This makes some members of the consensus protocol to detect two nodes with the same name (e.g., consul-server-2) running on different IPs (one from the new pod, and one from the old pod that was just replaced). But because this happens very quickly, the nodes don’t have time to “forget” the old pod causing a naming conflict.

adding the leave_on_terminate: true setting in the server.extraConfig appears to solve the problem. Now server nodes gracefully shutdown informing the whole cluster about it, so when the pod is recreated with the same name no more conflits arise. However this won’t be the case if the pod is killed abruptly due to some hardware failure, for instance.

One less issue to figure out. :grinning:

1 Like

What an invaluable hint, thanks a lot. This has actually solved our cluster update issues. Consul now updates without downtime.