As it is mentioned that each cluster has usually 3 or 5 server mode agents and potentially thousands of clients. We know that when a client agent is first started, it it fingerprints some data to servers in initial registration, and it is performing heartbeating periodically with servers to maintain liveness.
So I want to know, when a large number (i.e. serveral thousand) of clients are started nearly at the same time (within a few seconds), whether it will downgrade the performance of the servers (8C,16G,50G) be seriously, or affect some behaviors of the cluster? And could you give me some testing suggestions to verify this situation.
I do expect there to be a performance impact in the servers, specially since you are looking to add thousands of clients at a time.
It will probably take a few minutes for all of them finish registering. In our Two Million Container Challenge we started about 6,000 clients at once and it took about 6:30min for them to finish. The Consul team went further and started 10,000 clients
These challenges allowed us to uncover some optimization opportunities, so using Nomad 1.x is recommended.
The main bottleneck for you will probably be CPU, as the server will need to process a lot of incoming data. Additional RAM will also help as most of the internal state is kept in memory.
Looking at metrics will be important. You would want to keep an eye for CPU and memory usage and
nomad.nomad.heartbeat.active will tell you how many clients have successfully registered.
Yes, once the clients are registered they will periodically send heartbeats and allocation status. These should have little performance impact in the servers, unless you expect to have a high rate of allocation churn. From the Two Million Container Challenge post:
Nomad schedulers automatically adjust the rate at which nodes must heartbeat before they are considered “down” and their work is rescheduled. The larger a cluster, the longer between heartbeats to ease heartbeat processing overhead on Nomad schedulers.