[nomad][server] leader change per minute when high workload

Hi again :smiley:

We have nomad server that connect to consul server with consul client in same node,

here is the config and command to run it :

advertise {
  http = "10.11.X.X"
  rpc  = "10.11.X.X"
  serf = "10.11.X.X"
}
server {
  enabled             = true
  enable_event_broker = true
  raft_protocol       = 3
  num_schedulers      = 128
  event_buffer_size   = 100
  raft_multiplier     = 2
  heartbeat_grace     = "1h"
}
leave_on_interrupt = true
leave_on_terminate = true
telemetry {
  publish_allocation_metrics = true
  publish_node_metrics       = true
  prometheus_metrics         = true
  disable_hostname           = true
}
limits {
  http_max_conns_per_client = 0
  rpc_max_conns_per_client  = 0
  rpc_handshake_timeout     = "20s"
}
/usr/bin/nomad agent \
  -server \
  -bind="0.0.0.0" \
  -bootstrap-expect=5 \
  -encrypt="{{redacted}}" \
  -data-dir=/nomad/data \
  -config=/nomad/config \
  -dc="dc1" \
  -node="10.11.X.X" \
  -consul-address="127.0.0.1:8500" \
  -consul-auto-advertise \
  -consul-checks-use-advertise \
  -retry-join="10.11.X.X" \
  -retry-join="10.11.X.Y" \
  -retry-join="10.11.X.Z" \
  -retry-join="10.11.X.A" \
  -retry-join="10.11.X.B" \
  -consul-server-auto-join

and we have a golang service that communicate to nomad server using nomad api library, let’s say the service can spawn a job with unique name and config each job

we than do loadtest to our server
when our job hit: ~6000 job

our server have a symptop leader change per ~1 minutes, we check this using script

2021-04-20 09:36:39 [INFO] [leader-check] 10.11.X.X.global
2021-04-20 09:37:39 [INFO] [leader-check] 10.11.X.Y.global
2021-04-20 09:38:39 [INFO] [leader-check] 10.11.X.Y.global
2021-04-20 09:39:40 [INFO] [leader-check] 10.11.X.Z.global
2021-04-20 09:40:40 [INFO] [leader-check] 10.11.X.Z.global
2021-04-20 09:41:40 [INFO] [leader-check] 10.11.X.A.global
2021-04-20 09:42:43 [INFO] [leader-check] 10.11.X.A.global
2021-04-20 09:43:43 [INFO] [leader-check] 10.11.X.A.global
2021-04-20 09:44:43 [INFO] [leader-check] 10.11.X.A.global
2021-04-20 09:45:49 [INFO] [leader-check] 10.11.X.Z.global
2021-04-20 09:46:49 [INFO] [leader-check] 10.11.X.X.global
2021-04-20 09:47:49 [INFO] [leader-check] 10.11.X.Z.global

after go deeper the leader lost each ~20 seconds, and raft do leader election,

and the we stuck at our job hit : ~15000 job

the leader lost is too frequent and the job count didn’t increase or just increase 1 job per ~2-5 minutes

we use spec :

CPU: 2.5 Core
Memory: 12GB
Disk: 24 GB ( IOPS : 12000 )

but the average for CPU and memory Usage was :

Server                       CPU          Memory
10.11.X.X                  122m         1062Mi
10.11.X.Y                  61m          1072Mi
10.11.X.Z                  125m         1921Mi
10.11.X.A                  1132m        2004Mi
10.11.X.B                  124m         2102Mi

so based on resource usage above, the nomad server usage didn’t hit the limit we set

the nomad client node/instance already idle as we already calculate

Any advise how to achieve high scheduling throughput, 100-200 Job per minutes ? our target job: ~100.000 job

Thanks for your attentions and help :bowing_man:

Hi @petrukngantuk1 :wave:

It’s kind of hard to tell without more detailed metrics. Here are a few things that you might want to look for.

  • Average CPU and memory can hide spikes and other relevant behaviours. Would you happen to have some charts or other monitoring available?

  • Do you see anything relevant in the server logs?

  • For servers it’s also important to have a fast connection between them. From our Deployment Guide:

    Nomad servers are expected to be able to communicate in high bandwidth, low latency network environments and have below 10 millisecond latencies between cluster members.

    Could you check the latency between them?

  • Can you try the same test with 3 servers? Fewer servers means less data to replicate.

Take a look at our Two Million Container Challenge. It could have some useful insight :slightly_smiling_face:

@lgfa29 , well noted!

I’m curious, I just realized about the network latency, is there a way to minimize the network latency in the AWS ?, I guess this one thing that still didn’t we touch

is there any metrics that represent latency between server @lgfa29 ?

Update :

  • I collect idle nomad server for metrics nomad_raft_leader_lastContact with prometheus, it’s already ~150 ms


2 hours range

  • an example server logs that have No Cluster Leader, but self recovered :
    2021-04-21T07:25:58.917Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.917Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.917Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.918Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.918Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.918Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.919Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.919Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.920Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.920Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.920Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.920Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.920Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.920Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.920Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.921Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.921Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.921Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.922Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.922Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.922Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.922Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.924Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.924Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.924Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.924Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.924Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.925Z [INFO]  nomad.raft: entering follower state: follower="Node at 10.7.1.55:4647 [Follower]" leader=
    2021-04-21T07:25:58.925Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.925Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:25:58.925Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: No cluster leader"
    2021-04-21T07:26:54.113Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.113Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.114Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.114Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.114Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.114Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.114Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.114Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.114Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.114Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.116Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.118Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.118Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.118Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.118Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.118Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.118Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.119Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.119Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.119Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.119Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.119Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.119Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.119Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.119Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.119Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.119Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.119Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.119Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.119Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.119Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.119Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.119Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.119Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.119Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.119Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.120Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.120Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.120Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.120Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.121Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:54.429Z [ERROR] worker: failed to dequeue evaluation: error="rpc error: eval broker disabled"
    2021-04-21T07:26:56.273Z [WARN]  nomad.raft: rejecting vote request since we have a leader: from=10.7.4.207:4647 leader=10.7.3.204:4647
    2021-04-21T07:26:56.441Z [WARN]  nomad.raft: rejecting vote request since we have a leader: from=10.7.1.215:4647 leader=10.7.3.204:4647
    2021-04-21T07:26:56.872Z [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader=10.7.3.204:4647

another :

    2021-04-21T08:10:03.314Z [WARN]  nomad.raft: failed to contact: server-id=2ac35171-4784-78ef-3232-203b9f9e2410 time=1.31669654s
    2021-04-21T08:10:09.711Z [WARN]  nomad.raft: failed to contact: server-id=f0ec2a27-10c4-8706-13ef-c322599bf281 time=1.002158884s
    2021-04-21T08:10:09.992Z [WARN]  nomad.raft: failed to contact: server-id=f0ec2a27-10c4-8706-13ef-c322599bf281 time=1.283105762s
    2021-04-21T08:10:09.992Z [WARN]  nomad.raft: failed to contact: server-id=bfcaaf21-d8bf-6db2-9354-240eed0d37ac time=1.095547878s
    2021-04-21T08:10:10.100Z [WARN]  nomad.raft: failed to contact: server-id=2ac35171-4784-78ef-3232-203b9f9e2410 time=1.001534624s
    2021-04-21T08:10:10.100Z [WARN]  nomad.raft: failed to contact: server-id=f0ec2a27-10c4-8706-13ef-c322599bf281 time=1.391086623s
    2021-04-21T08:10:10.100Z [WARN]  nomad.raft: failed to contact: server-id=2733627a-e57e-eb59-decf-05e3408ca2a1 time=1.001589765s
    2021-04-21T08:10:10.100Z [WARN]  nomad.raft: failed to contact: server-id=bfcaaf21-d8bf-6db2-9354-240eed0d37ac time=1.203528739s
    2021-04-21T08:10:10.100Z [WARN]  nomad.raft: failed to contact quorum of nodes, stepping down
    2021-04-21T08:10:10.100Z [INFO]  nomad.raft: entering follower state: follower="Node at 10.7.1.55:4647 [Follower]" leader=
    2021-04-21T08:10:10.248Z [INFO]  nomad.raft: aborting pipeline replication: peer="{Voter f0ec2a27-10c4-8706-13ef-c322599bf281 10.7.4.207:4647}"
    2021-04-21T08:10:10.290Z [ERROR] nomad.client: alloc update failed: error="leadership lost while committing log"
    2021-04-21T08:10:10.291Z [INFO]  nomad.raft: aborting pipeline replication: peer="{Voter bfcaaf21-d8bf-6db2-9354-240eed0d37ac 10.7.4.201:4647}"

  • the nomad_raft_leader_lastContact metrics

not sure why when we have high workload make above metrics jump to 1s, as we already use m5ad.xlarge for instance type

  • checking the latency for between nomad server it’s already below 10ms (from consul) :

Screenshot_2021-04-21-14-46-49

CPU Usage (15m) :

Memory Usage (15m) :

for other this above will answer in this thread, If I can collect some…

Thanks for the extra information!

For AWS you would want all your servers to be in the same region. You can split them among availability zones within the same region for reliability. Checkout the network topology requirements for more info.

Looking at the charts, are they all in the same time window? It would be interesting to see what happens with CPU and memory when nomad_raft_leader_lastContact spikes.

well, IIRC that nomad_raft_leader_lastContact data, is null then it’s have, cycle on it, like dotted line, not constant, like connected line, based on timestamp,

anyway, already solve the issue with upgrade the instance type to higher, it’s bottleneck in memory, thanks for your help btw

1 Like