Hello.
We’re seeing a very odd specific issue.
Consul v1.14.3
Nomad v1.4.3
Example jobspec connect configuration
network {
mode = "bridge"
port "http" { to = "8080"}
port "metrics" {}
}
service {
name = "service"
port = "http"
tags = ["http","addr:${NOMAD_HOST_ADDR_metrics}","prometheus"]
meta {
metrics_port = "${NOMAD_HOST_PORT_metrics}"
nomad_alloc_index = "${NOMAD_ALLOC_INDEX}"
nomad_job_name = "${NOMAD_JOB_NAME}"
}
check {
type = "http"
path = "/ping"
interval = "10s"
timeout = "2s"
}
connect {
sidecar_service {
tags = ["service"]
proxy {
expose {
path {
path = "/metrics"
protocol = "http"
local_path_port = 8080
listener_port = "metrics"
}
}
}
}
}
}
When deploying for the first time or with a new job spec this works exactly as expected exposing the /metrics endpoint.
However when the job gets restarted (through oom, reboot, manual start/stop) the /metrics endpoint fails to be exposed.
We get Connection Refused on the /metrics endpoint and Connection reset on the sidecar proxy.
I cannot find any errors in reference to this in nomad, consul or even the envoy proxy pods.
To “fix” the issue simple redeploying with an updated spec fixed the issue. Is there some difference in a job that could break from a restart vs redeployment?
Thanks.
Update:
We have now found when the job is restarted/rebooted etc. The jobspec looses the expose stanza.
Filed a bug with findings
opened 04:51PM - 25 Jan 23 UTC
type/bug
<!--
Hi there,
Thank you for opening an issue. Please note that we try to ke… ep the Nomad issue
tracker reserved for bug reports and feature requests. For general usage
questions, please see: https://www.nomadproject.io/community
-->
### Nomad version
Output from `nomad version`
Nomad v1.4.3 (f464aca721d222ae9c1f3df643b3c3aaa20e2da7)
### Operating system and Environment details
```
Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal
```
Nomad config :
```
datacenter = "eu1"
data_dir = "/opt/nomad"
log_level = "DEBUG"
log_file = "/var/log/nomad.log"
log_json = true
server {
enabled = true
bootstrap_expect = 5
}
client {
enabled = true
}
server_join {
retry_join = ["192.168.1.1","192.168.1.2","192.168.1.13","192.168.1.4","192.168.1.5"]
retry_max = 3
retry_interval = "15s"
}
acl {
enabled = true
}
consul {
address = "127.0.0.1:8500"
grpc_address = "127.0.0.1:8502"
server_service_name = "nomad"
client_service_name = "nomad-client"
auto_advertise = true
server_auto_join = true
client_auto_join = true
token = ""
}
vault {
enabled = true
address = "https://vault.domain.com"
create_from_role = "nomad-cluster"
token = ""
}
ui {
enabled = true
consul {
ui_url = "https://consul.domain.com/ui"
}
vault {
ui_url = "https://vault.domain.com/ui"
}
}
plugin "docker" {
config {
logging {
type = "loki"
}
auth {
config = "/root/.docker/config.json"
}
}
}
telemetry {
collection_interval = "5s",
publish_allocation_metrics = true,
publish_node_metrics = true,
prometheus_metrics = true
}
```
Consul config
```
datacenter = "eu1"
data_dir = "/opt/consul"
log_level = "DEBUG"
node_name = "server-1"
advertise_addr = "192.168.1.1"
encrypt = ""
tls {
defaults {
ca_file = ""
ca_path = ""
cert_file = ""
key_file = ""
verify_incoming = true
verify_outgoing = true
}
internal_rpc {
verify_server_hostname = true
}
}
auto_encrypt {
allow_tls = true
}
retry_join = ["192.168.1.1","192.168.1.2","192.168.1.13","192.168.1.4","192.168.1.5"]
acl {
enabled = true
default_policy = "allow"
enable_token_persistence = true
}
performance {
raft_multiplier = 1
}
server = true
bootstrap_expect = 5
bind_addr = "192.168.1.1"
client_addr = "0.0.0.0"
# Enable service mesh
connect {
enabled = true
}
# Addresses and ports
addresses {
grpc = "127.0.0.1"
https = "0.0.0.0"
dns = "127.0.0.1"
}
ports {
grpc = 8502
grpc_tls = 8503
http = 8500
https = 8443
dns = 8600
}
# DNS Recursion
recursors = ["1.1.1.1"]
ui_config {
enabled = true
}
```
### Issue
When stopping/starting the nomad job with an expose stanza and a sidecar proxy the expose stanza get removed from the job spec in nomad and no longer exposes the /metrics. The only way to "fix" the issue is to redeploy the job.
### Reproduction steps
Deploying the job. Stopping and then restarting the service
#### Expected Result
The /metrics endpoint to be exposed and accessible, still in the nomad jobspec
#### Actual Result
The expose block gets removed from the job spec
![image](https://user-images.githubusercontent.com/53433884/214623566-3e7616b2-2952-4a57-8f27-cfa1e8365516.png)
### Job file (if appropriate)
```
job "service" {
datacenters = ["eu1"]
group "frontends" {
count = 2
network {
mode = "bridge"
port "http" { to = "8080"}
port "metrics" {}
}
service {
name = "service"
port = "http"
tags = ["http","addr:${NOMAD_HOST_ADDR_metrics}","prometheus"]
meta {
metrics_port = "${NOMAD_HOST_PORT_metrics}"
nomad_alloc_index = "${NOMAD_ALLOC_INDEX}"
nomad_job_name = "${NOMAD_JOB_NAME}"
}
check {
type = "http"
path = "/ping"
interval = "10s"
timeout = "2s"
}
connect {
sidecar_service {
tags = ["service-frontend"]
proxy {
expose {
path {
path = "/metrics"
protocol = "http"
local_path_port = 8080
listener_port = "metrics"
}
}
}
}
}
}
task "service-frontend" {
driver = "docker"
config {
image = ""
command = "bundle"
args = ["exec", "puma", "-C", "config/puma.rb"]
ports = ["http"]
}
resources {
memory = 513
}
}
}
group "sidekiq" {
count = 4
update {
max_parallel = 1
}
network {
mode = "bridge"
port "http" { to = "9359"}
port "metrics" {}
}
service {
name = "service-sidekiq"
port = "http"
tags = ["http","addr:${NOMAD_HOST_ADDR_metrics}","prometheus"]
meta {
metrics_port = "${NOMAD_HOST_PORT_metrics}"
nomad_alloc_index = "${NOMAD_ALLOC_INDEX}"
nomad_job_name = "${NOMAD_JOB_NAME}"
}
connect {
sidecar_service {
tags = ["service"]
proxy {
expose {
path {
path = "/metrics"
protocol = "http"
local_path_port = 9359
listener_port = "metrics"
}
}
}
}
}
}
task "sidekiq" {
driver = "docker"
kill_timeout = "15s"
config {
image = ""
command = "bundle"
args = ["exec", "sidekiq", "-t", "10"]
ports = ["http"]
}
resources {
memory = 256
}
}
}
}
```
<!--
If possible please post relevant logs in the issue.
`
Logs and other artifacts may also be sent to: nomad-oss-debug@hashicorp.com
Please link to your Github issue in the email and reference it in the subject
line:
> To: nomad-oss-debug@hashicorp.com
>
> Subject: GH-1234: Errors garbage collecting allocs
Emails sent to that address are readable by all HashiCorp employees but are
*not* publicly visible.
-->
### Nomad Server logs (if appropriate)
### Nomad Client logs (if appropriate)
No relevant logs discovered
jrasell
February 3, 2023, 10:45am
3
Hi @dpewsey ,
Thanks for raising the issue. It looks like this fix has been merged into the release branches, and will therefore be available in the next release.
Thanks,
jrasell and the Nomad team