Hello,
I have been trying out multiple deployment modes for Consul. I have sucessfully deployed a multi-DC via WAN federation using autoscaling groups in AWS, and now I’m moving to try out deploying on top of K8s.
The experience was pretty smooth with ASG; in the end I got a pretty stable setup where I could add/remove nodes at will and the datacenters reacted smoothly (e.g, consensus was impeccable).
However, I have been battling some issues with K8s. So far, I got a couple of issues that make me wonder if running on top of K8s in a production scenario is plausible and better than ASGs:
- changing the extraConfig setting doesn’t seem to trigger a pod refresh
- when scaling up from 3 to 5 replicas, for some reason I lost consensus and Raft took over 10minutes to reach one
- when scaling down consensus was also lost (which might be ok if the leader was killed), but the consensus also took a long time.
- when scaling down to a single node (from 5 nodes) “consensus” was not reached yet, after 30 minutes.
- upgrading my helm chart to the latest version doesn’t refresh and upgrades the server pods. It seems that during the upgrade, only the init jobs were executed, everything else stayed the same.
I’m probably missing something that could explain these issues, as I am fairly new working with K8s.
I added my config.yamls chart configuration below for reference (I know most of the values there are the default ones, but I just wanted to have a yaml with the full configs so I could tweak incrementally)
I also believe that I read somewhere that Hashicorp officially suggests to not deploy Consul on K8s, although it’s possible. Is this a fair statement? I also think that an highly dynamic environment such K8s is not the best fit for a stateful system like Consul (at least for the server nodes).
global:
enabled: true
logLevel: "debug"
logJSON: false
name: "dlo"
datacenter: "dlo"
consulAPITimeout: "5s"
enablePodSecurityPolicies: true
recursors: []
tls:
enabled: true
enableAutoEncrypt: true
serverAdditionalDNSSANs: []
serverAdditionalIPSANs: []
verify: true
httpsOnly: true
caCert:
secretName: null
secretKey: null
caKey:
secretName: null
secretKey: null
acls:
manageSystemACLs: true
bootstrapToken:
secretName: null
secretKey: null
createReplicationToken: true
replicationToken:
secretName: null
secretKey: null
gossipEncryption:
autoGenerate: true
federation:
enabled: false
createFederationSecret: false
primaryDatacenter: null
primaryGateways: []
k8sAuthMethodHost: null
metrics:
enabled: false
enableAgentMetrics: false
agentMetricsRetentionTime: "1m"
enableGatewayMetrics: true
server:
replicas: 1
bootstrapExpect: 1
updatePartition: 1 # TODO: for some reason, if setting to the 'number of replicas' things get broken on minikube
#affinity: null # for minikube, set null
connect: true # setup root CA and certificates
extraConfig: |
{
"log_level": "DEBUG",
"log_file": "/consul/",
"log_rotate_duration": "24h",
"log_rotate_max_files": 7
}
client:
enabled: false
affinity: null
updateStrategy: |
rollingUpdate:
maxUnavailable: 1
type: RollingUpdate
extraConfig: |
{
"log_level": "DEBUG"
}
ui:
enabled: true
service:
enabled: true
type: LoadBalancer
port:
http: 80
https: 443
metrics:
enabled: false
ingress:
enabled: false
dns:
enabled: false
externalServers:
enabled: false
syncCatalog:
enabled: false
connectInject:
enabled: false
controller:
enabled: false
meshGateway:
enabled: false
ingressGateways:
enabled: false
terminatingGateways:
enabled: false
apiGateway:
enabled: false
webhookCertManager:
tolerations: null
prometheus:
enabled: false