Can a scaling policy that scales the cluster also scale up the application? Or are they mutually exclusive?
Each policy can only have one target, so it will either scale the cluster or the app. We haven’t really thought about this because the query
usually returns metrics for a specific target (like memory available in the cluster, or request latency of an app), but I am curious to learn about the use case you have in mind
If there is capacity for another copy of the application on the running server, will the policy which includes config to scale an aws-asg add another copy of the application on the running machine, it will it spin up another machine?
If I understand this correctly, there are actually two things going on here, and that’s why each policy can only have on target.
Let’s say you have an application policy like this (simplified for brevity):
policy {
check "avg_sessions" {
source = "prometheus"
query = "open_connections / nomad_nomad_job_summary_running"
strategy "target-value" {
target = 10
}
}
This policies will make sure that you have and average of 10 open connections per application instance. Now let’s imagine that your app is trending on Twitter () , and your metric jumps to 200 connections per instance (). The Autoscaler will update your job to add 20x new instances.
But your cluster can’t handle this many instances, so some allocations will be stuck pending, waiting for new resources.
Now let’s imagine that you also have a cluster policy like this (also simplified for brevity):
policy {
check "mem_allocated_percentage" {
source = "prometheus"
query = "100 * nomad_client_allocated_memory/(nomad_client_unallocated_memory + nomad_client_allocated_memory)"
strategy "target-value" {
target = 70
}
}
target "aws-asg" {
aws_asg_name = "hashistack-nomad_client"
}
}
This policy will look at the percentage of used memory in the cluster and scale up (add nodes) when more than 70% of memory is used.
With 20x more app instances trying to run you will certainly run out of memory. The Autoscaler will detect this because of the cluster policy above and add new nodes to meet the demand.
Once these new nodes are up, the pending allocations will be scheduled into them just like Nomad would normally do.
As you can see, two scaling events happened: both the app and the cluster scaled up. But they did so for different reasons and independently from each other: the app had too many connections to handle and the cluster ran out of memory.
So if you want multiple things to happen you will most likely need multiple policies.
The only side note here is for system jobs, which Nomad will automatically schedule so that there’s one instance running per node.