TLS errors in K8 using helm chart deployment

HarishankarYellaprag · August 24, 2021, 4:35pm

Hi, I’m using helm chart deployment

Following errors are showing on consul-server-0, consul-server-1, consul-server-2

[ERROR] agent.http: Request error: method=GET url=/v1/agent/connect/ca/roots from=192.168.10.66:59128 error="No cluster leader"
[ERROR] agent.http: Request error: method=GET url=/v1/agent/connect/ca/roots from=192.168.115.28:43628 error="No cluster leader"
[ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 67f9af62-ea6f-7f10-fcf8-0ddd76e5fc66 192.168.10.68:8300}" error="dial tcp <nil>->192.168.10.68:8300: i/o timeout"
[ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter d3d1a401-c506-3cf4-4fec-3afcb7603219 192.168.115.42:8300}" error="dial tcp <nil>->192.168.115.42:8300: i/o timeout"
[WARN]  agent.server.raft: Election timeout reached, restarting election
===================================
Following are the errors for consul-connect-injector-webhook-deployment and  consul-controller :

Failed to load logs: container "sidecar-injector" in pod "consul-connect-injector-webhook-deployment-7cdb8b8bcf-wtnlq" is waiting to start: PodInitializing
Reason: BadRequest (400)

Using hem chart config.yaml

global:
  name: consul
  datacenter: dc1
  gossipEncryption:
    secretName: 'consul-gossip-encryption-key'
    secretKey: 'key'
  tls:
    enabled: true
    enableAutoEncrypt: true
    verify: false
#  acls:
#    manageSystemACLs: true
server:
  replicas: 3
  bootstrapExpect: 3
  disruptionBudget:
    enabled: true
    maxUnavailable: 0
  updatePartition:
  securityContext:
    runAsNonRoot: false
    runAsUser: 0
ui:
# service:
#    type: "LoadBalancer"
  enabled: true
connectInject:
  enabled: true
controller:
  enabled: true
prometheus:
  enabled: true
grafana:
  enabled: true

====================================
Used following commands on master node
consul tls ca create

consul tls cert create -server -dc dc1

moved agent-ca.pem, agent-ca-key, server-consul-0-key.pem and server-consul-0.pem /etc/consul.d/

copied agent-ca.pem, server-consul-0-key.pem and server-consul-0.pem to all consul servers

systemctl restart consul
consul reload

lkysow · August 24, 2021, 5:24pm

Hi, first of all you don’t need to run any consul tls commands. This is all handled automatically for you.

Second, the issue right now is that the servers can’t seem to reach one another (dial tcp <nil>->192.168.10.68:8300: i/o timeout). If you do kubectl get pods -o wide is this the correct IP for one of the servers?

Also, is there a chance you re-installed Consul without deleting the PVCs? See our uninstall guide here: Uninstall | Consul by HashiCorp

HarishankarYellaprag · August 24, 2021, 6:32pm

@lkysow I removed manually created tls certs and I tired again reinstalling.

and yes, I will delete and recreate PVCs for every fresh installation.

Finally, I don’t see the IP for any pod that consul server pod trying to connect dial tcp <nil>->192.168.115.20:8300: i/o doesn’t exists

kubectl get pods -o wide

NAME                                                          READY   STATUS     RESTARTS   AGE   IP                NODE         NOMINATED NODE   READINESS GATES
consul-connect-injector-webhook-deployment-7cdb8b8bcf-828d6   0/1     Init:0/1   0          52s   192.168.112.218   tmp-k8c1w1   <none>           <none>
consul-connect-injector-webhook-deployment-7cdb8b8bcf-zb9mn   0/1     Init:0/1   0          53s   192.168.10.114    tmp-k8c1w3   <none>           <none>
consul-controller-6796bb8886-2wq2s                            0/1     Init:0/1   0          53s   192.168.115.32    tmp-k8c1w2   <none>           <none>
consul-dd5jb                                                  0/1     Running    0          52s   192.168.115.25    tmp-k8c1w2   <none>           <none>
consul-dkzv2                                                  0/1     Running    0          53s   192.168.112.222   tmp-k8c1w1   <none>           <none>
consul-server-0                                               0/1     Running    0          53s   192.168.115.39    tmp-k8c1w2   <none>           <none>
consul-server-1                                               0/1     Running    0          52s   192.168.10.68     tmp-k8c1w3   <none>           <none>
consul-server-2                                               0/1     Running    0          51s   192.168.112.230   tmp-k8c1w1   <none>           <none>
consul-webhook-cert-manager-57bb5c668d-cz8dp                  1/1     Running    0          53s   192.168.10.126    tmp-k8c1w3   <none>           <none>
consul-xs8gp                                                  0/1     Running    0          52s   192.168.10.119    tmp-k8c1w3   <none>           <none>
prometheus-server-5cbddcc44b-kqfjf                            2/2     Running    0          53s   192.168.112.213   tmp-k8c1w1   <none>           <none>

Errors log from Consul-server-2 :

[ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 49c04b99-25fb-3589-fb02-375e5e5590ca 192.168.115.20:8300}" error="dial tcp <nil>->192.168.115.20:8300: i/o timeout"

[ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 0b49673d-a431-b406-e30c-f719e5f5727a 192.168.10.109:8300}" error="dial tcp <nil>->192.168.10.109:8300: i/o timeout"

[ERROR] agent: Coordinate update error: error="No cluster leader"

Error log from Consul-client pod:

[ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=192.168.115.39:8300 error="rpcinsecure error making call: No cluster leader"
[ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=192.168.10.68:8300 error="rpcinsecure error making call: No cluster leader"
[ERROR] agent.auto_config: AutoEncrypt.Sign RPC failed: addr=192.168.112.230:8300 error="rpcinsecure error making call: No cluster leader"
[ERROR] agent.auto_config: No servers successfully responded to the auto-encrypt request

Error log from consul-controller and consul-connect-injector-webhook-deployment

Failed to load logs: container "controller" in pod "consul-controller-6796bb8886-2wq2s" is waiting to start: PodInitializing

Reason: BadRequest (400)

lkysow · August 24, 2021, 9:50pm

Hmm, I’m very curious where those IPs are coming from then. This is typically only seen in situations where there are old PVCs.

HarishankarYellaprag · August 27, 2021, 8:18pm

Fixed issue by deleting manually all the files in persistent volume folders in /mnt/data/pv