Hi. I use consul to monitor some service status.
Here is service health check script.
{
"ID": "serviceA",
"Name": "serviceA",
"Address": "10.154.0.26",
"Port": 443,
"Tags": [
"eu-central-1c"
],
"EnableTagOverride": false,
"Check": {
"Name": "serviceA",
"args" : ["./health_check.py", "--some flags",,,,,,,, ],
"tls_skip_verify": true,
"Interval": "10s",
"Timeout": "10s"
}
}
In most cases, it works normally. But sometimes(less then 10 times a day), this error occured.
2023/01/18 05:53:07 [WARN] agent: Check "serviceA": Timed out (10s) running check
health_check.py
is quite simple. it create bucket in aws s3, then put/get/delete object using python boto3. In most cases, it finished within 2 seconds. According to the log of health_check.py
script, when timeout occurs, the script seems to start to run with some delay. Although the interval is set to 10 seconds, it appears to start about 5 seconds later than expected.
This is time when the health_check.py
script started:(it leaves the start time as log.)
1. 22:51:00
2. 22:51:11
3. 22:51:20
4. 22:51:35 (timed out at 22:51:41)
5. 22:51:51
6. 22:52:01
7. 22:52:12
health_check.py
script seems to start every 10 seconds. Considering the timeout (22:51:41), it seems that the 4th check began around 22:51:31, which is 10s after in the 3rd check. but health_check.py
started at 22:51:35, with 4s delay. I wonder what makes this delay, which consume half of the timeout. any idea?
Here is my consul server & agent version
./consul --version
Consul v1.4.0