Restarting a job in nomad with consul connect sidecar causes the proxy to break

dpewsey · January 24, 2023, 4:47pm

Hello.
We’re seeing a very odd specific issue.
Consul v1.14.3
Nomad v1.4.3
Example jobspec connect configuration

    network {
      mode = "bridge"
      port "http" { to = "8080"}
      port "metrics" {}
    }    
    service {
      name = "service"
      port = "http"
      tags = ["http","addr:${NOMAD_HOST_ADDR_metrics}","prometheus"]
      meta {
        metrics_port = "${NOMAD_HOST_PORT_metrics}"
        nomad_alloc_index = "${NOMAD_ALLOC_INDEX}"
        nomad_job_name = "${NOMAD_JOB_NAME}"
      }
      check {
        type     = "http"
        path     = "/ping"
        interval = "10s"
        timeout  = "2s"
      }
      connect {
        sidecar_service {
          tags = ["service"]
          proxy {
            expose {
              path {
                path            = "/metrics"
                protocol        = "http"
                local_path_port = 8080
                listener_port   = "metrics"
              }
            }
          }
        }
      }
    }

When deploying for the first time or with a new job spec this works exactly as expected exposing the /metrics endpoint.

However when the job gets restarted (through oom, reboot, manual start/stop) the /metrics endpoint fails to be exposed.
We get Connection Refused on the /metrics endpoint and Connection reset on the sidecar proxy.
I cannot find any errors in reference to this in nomad, consul or even the envoy proxy pods.

To “fix” the issue simple redeploying with an updated spec fixed the issue. Is there some difference in a job that could break from a restart vs redeployment?
Thanks.

dpewsey · January 25, 2023, 4:29pm

Update:

We have now found when the job is restarted/rebooted etc. The jobspec looses the expose stanza.

Filed a bug with findings

github.com/hashicorp/nomad

Expose stanza is removed from job spec on stop/start.

opened 04:51PM - 25 Jan 23 UTC

dpewsey

type/bug

### Nomad version Output from `nomad version` Nomad v1.4.3 (f464aca721d222ae9c1f3df643b3c3aaa20e2da7) ### Operating system and Environment details ``` Distributor ID: Ubuntu Description: Ubuntu 20.04.4 LTS Release: 20.04 Codename: focal ``` Nomad config : ``` datacenter = "eu1" data_dir = "/opt/nomad" log_level = "DEBUG" log_file = "/var/log/nomad.log" log_json = true server { enabled = true bootstrap_expect = 5 } client { enabled = true } server_join { retry_join = ["192.168.1.1","192.168.1.2","192.168.1.13","192.168.1.4","192.168.1.5"] retry_max = 3 retry_interval = "15s" } acl { enabled = true } consul { address = "127.0.0.1:8500" grpc_address = "127.0.0.1:8502" server_service_name = "nomad" client_service_name = "nomad-client" auto_advertise = true server_auto_join = true client_auto_join = true token = "" } vault { enabled = true address = "https://vault.domain.com" create_from_role = "nomad-cluster" token = "" } ui { enabled = true consul { ui_url = "https://consul.domain.com/ui" } vault { ui_url = "https://vault.domain.com/ui" } } plugin "docker" { config { logging { type = "loki" } auth { config = "/root/.docker/config.json" } } } telemetry { collection_interval = "5s", publish_allocation_metrics = true, publish_node_metrics = true, prometheus_metrics = true } ``` Consul config ``` datacenter = "eu1" data_dir = "/opt/consul" log_level = "DEBUG" node_name = "server-1" advertise_addr = "192.168.1.1" encrypt = "" tls { defaults { ca_file = "" ca_path = "" cert_file = "" key_file = "" verify_incoming = true verify_outgoing = true } internal_rpc { verify_server_hostname = true } } auto_encrypt { allow_tls = true } retry_join = ["192.168.1.1","192.168.1.2","192.168.1.13","192.168.1.4","192.168.1.5"] acl { enabled = true default_policy = "allow" enable_token_persistence = true } performance { raft_multiplier = 1 } server = true bootstrap_expect = 5 bind_addr = "192.168.1.1" client_addr = "0.0.0.0" # Enable service mesh connect { enabled = true } # Addresses and ports addresses { grpc = "127.0.0.1" https = "0.0.0.0" dns = "127.0.0.1" } ports { grpc = 8502 grpc_tls = 8503 http = 8500 https = 8443 dns = 8600 } # DNS Recursion recursors = ["1.1.1.1"] ui_config { enabled = true } ``` ### Issue When stopping/starting the nomad job with an expose stanza and a sidecar proxy the expose stanza get removed from the job spec in nomad and no longer exposes the /metrics. The only way to "fix" the issue is to redeploy the job. ### Reproduction steps Deploying the job. Stopping and then restarting the service #### Expected Result The /metrics endpoint to be exposed and accessible, still in the nomad jobspec #### Actual Result The expose block gets removed from the job spec ![image](https://user-images.githubusercontent.com/53433884/214623566-3e7616b2-2952-4a57-8f27-cfa1e8365516.png) ### Job file (if appropriate) ``` job "service" { datacenters = ["eu1"] group "frontends" { count = 2 network { mode = "bridge" port "http" { to = "8080"} port "metrics" {} } service { name = "service" port = "http" tags = ["http","addr:${NOMAD_HOST_ADDR_metrics}","prometheus"] meta { metrics_port = "${NOMAD_HOST_PORT_metrics}" nomad_alloc_index = "${NOMAD_ALLOC_INDEX}" nomad_job_name = "${NOMAD_JOB_NAME}" } check { type = "http" path = "/ping" interval = "10s" timeout = "2s" } connect { sidecar_service { tags = ["service-frontend"] proxy { expose { path { path = "/metrics" protocol = "http" local_path_port = 8080 listener_port = "metrics" } } } } } } task "service-frontend" { driver = "docker" config { image = "" command = "bundle" args = ["exec", "puma", "-C", "config/puma.rb"] ports = ["http"] } resources { memory = 513 } } } group "sidekiq" { count = 4 update { max_parallel = 1 } network { mode = "bridge" port "http" { to = "9359"} port "metrics" {} } service { name = "service-sidekiq" port = "http" tags = ["http","addr:${NOMAD_HOST_ADDR_metrics}","prometheus"] meta { metrics_port = "${NOMAD_HOST_PORT_metrics}" nomad_alloc_index = "${NOMAD_ALLOC_INDEX}" nomad_job_name = "${NOMAD_JOB_NAME}" } connect { sidecar_service { tags = ["service"] proxy { expose { path { path = "/metrics" protocol = "http" local_path_port = 9359 listener_port = "metrics" } } } } } } task "sidekiq" { driver = "docker" kill_timeout = "15s" config { image = "" command = "bundle" args = ["exec", "sidekiq", "-t", "10"] ports = ["http"] } resources { memory = 256 } } } } ```  ### Nomad Server logs (if appropriate) ### Nomad Client logs (if appropriate) No relevant logs discovered

jrasell · February 3, 2023, 10:45am

Hi @dpewsey,

Thanks for raising the issue. It looks like this fix has been merged into the release branches, and will therefore be available in the next release.

Thanks,
jrasell and the Nomad team

Topic		Replies	Views
Existing nomad jobs in mesh need to be restarted after consul tls is enabled Consul	0	269	July 27, 2021
Consul agent unable to talk to sidecar after firewalld restart Nomad	3	539	January 7, 2022
Nomad and Consul Connect with Expose Nomad connect , nomad	5	1002	November 18, 2022
Nomad, consul connect and Traefik: fails to scale with zero downtime Nomad connect	0	487	February 23, 2023
Getting to grips with sidecar_service, consul and service mesh Nomad connect , consul	0	456	May 3, 2022

Restarting a job in nomad with consul connect sidecar causes the proxy to break

Related topics