Deploying temporal.io using Nomad - need guidance on a networking issue

Hello

We are deploying Temporal using Nomad on AWS. There is a networking related issue that has been unresolved since the past few months, and as we start to scale up, we need to solve it to be able to deploy temporal in a clustered config.

Some context

Temporal itself is made up of 4 services

  • Worker
  • Matching
  • History
  • Frontend

Temporal has inbuilt discovery and load balancing, and each of these services can be independently scaled by deploying multiple containers. The way discovery works is - all services are given a cassandra endpoint along with a broadcast IP and port, and on startup the service registers itself on a cassandra table, and discovers other services via this table.

A basic clustered deployment of temporal looks like this:

All services are running as docker containers in bridge mode.

The broadcast IP is given as the underlying host’s IP (the AWS private IP). The ports are assigned randomly by nomad. These services are also in the consul service mesh, as each of these connect to a cassandra instance using cql-proxy, which also runs as containers on nomad.

Things seem to work mostly fine, service discovery works as expected. Network connectivity across containers also works.

For eg. I am able to connect from matching1 to history2. I tested this by entering the shell of the matching1 container and trying a telnet connection to history2 by using the 10.0.0.2:<history2 port> (this is the direct service port, and not the port of the sidecar proxy). I tested this with other services as well, and things work as expected.

Now the problem…

Temporal has a condition in its matching service, where sometimes it tries to reach the self container. So the matching1 container, tries to establish a connection to itself. And it does this using the matching1 service’s registered IP and port - 10.0.0.1:<matching1 port>. This does not work.

I can test it from the container shell as well. telnet with 10.0.0.1:<matching1 port> fails, whereas localhost:<matching1 port> works.

While temporal retries and eventually ends up trying the other matching service, which does connect - this behavior results in a severe degradation in terms of performance.

Question

What causes this issue. I don’t understand bridge networking deep enough to understand why connections from matching1 to matching1 fails, whereas connections such as matching1 to worker1 works (any other container on the same host). What configuration would be needed for this network connectivity to work?

Attaching the nomad job file for reference

variable "log_source" {
  description = "log source for datadog"
  type = string
  default = "test"
}

variable "keyspace" {
  description = "temporal keyspace"
  type = string
  default = "temporal"
}

variable "visibility_keyspace" {
  description = "temporal visibility keyspace"
  type = string
  default = "temporal_visibility"
}

variable "history_shard_count" {
  description = "number of history shards. THIS MUST NOT BE CHANGED!"
  type = number
  default = 64
}

variable "log_level" {
  description = "log level"
  type = string
  default = "info"
}

variable "temporal_server_version" {
  type = string
  default = "1.17.2"
}

job "temporal-server" {
  datacenters = [
    "ap-south-1a",
    "ap-south-1b",
    "ap-south-1c"
  ]

  type = "service"

  ### TEMPORAL HISTORY STARTS ###
  group "temporal-history" {
    count = 2

    network {
      mode = "bridge"
      port "grpc" {}
      port "gossip" {}
    }

    service {
      connect {
        sidecar_service {
          tags = ["proxy"]
          proxy {
            upstreams {
              destination_name = "cql-proxy"
              local_bind_port = 9042
            }
          }
        }
      }
    }

    task "temporal-history" {
      driver = "docker"

      template {
        data = <<EOH
matching.numTaskqueueReadPartitions:
- value: 5
  constraints: {}
  matching.numTaskqueueWritePartitions:
- value: 5
  constraints: {}
EOH
        destination = "/local/config/dynamicconfig"
      }

      env {
        CASSANDRA_SEEDS="localhost"
        KEYSPACE = var.keyspace
        VISIBILITY_KEYSPACE = var.visibility_keyspace
        SKIP_SCHEMA_SETUP = true
        NUM_HISTORY_SHARDS = var.history_shard_count
        SERVICES = "history"
        LOG_LEVEL = var.log_level
        DYNAMIC_CONFIG_FILE_PATH = "/local/config/dynamicconfig"
        TEMPORAL_BROADCAST_ADDRESS = "${attr.unique.network.ip-address}"
        BIND_ON_IP = "0.0.0.0"
        HISTORY_GRPC_PORT = "${NOMAD_PORT_grpc}"
        HISTORY_MEMBERSHIP_PORT = "${NOMAD_PORT_gossip}"
      }

      config {
        image = "temporalio/server:${var.temporal_server_version}"
        ports = ["grpc", "gossip"]
        labels = {
          "com.datadoghq.ad.logs" = "[{\"source\": \"${var.log_source}\", \"service\": \"${NOMAD_GROUP_NAME}-${var.log_source}\"}]"
        }
      }

      resources {
        cpu    = 400
        memory = 400
      }
    }
  }
  ### TEMPORAL HISTORY ENDS ###

  ### TEMPORAL MATCHING STARTS ###
  group "temporal-matching" {
    count = 2

    network {
      mode = "bridge"
      port "grpc" {}
      port "gossip" {}
    }

    service {
      connect {
        sidecar_service {
          tags = ["proxy"]
          proxy {
            upstreams {
              destination_name = "cql-proxy"
              local_bind_port = 9042
            }
          }
        }
      }
    }

    task "temporal-matching" {
      driver = "docker"

      template {
        data = <<EOH
matching.numTaskqueueReadPartitions:
- value: 5
  constraints: {}
  matching.numTaskqueueWritePartitions:
- value: 5
  constraints: {}
EOH
        destination = "/local/config/dynamicconfig"
      }

      env {
        CASSANDRA_SEEDS="localhost"
        KEYSPACE = var.keyspace
        VISIBILITY_KEYSPACE = var.visibility_keyspace
        SKIP_SCHEMA_SETUP = true
        NUM_HISTORY_SHARDS = var.history_shard_count
        SERVICES = "matching"
        LOG_LEVEL = "debug"
        TEMPORAL_BROADCAST_ADDRESS = "${attr.unique.network.ip-address}"
        BIND_ON_IP = "0.0.0.0"
        MATCHING_GRPC_PORT = "${NOMAD_PORT_grpc}"
        MATCHING_MEMBERSHIP_PORT = "${NOMAD_PORT_gossip}"
        DYNAMIC_CONFIG_FILE_PATH = "/local/config/dynamicconfig"
      }

      config {
        image = "temporalio/server:${var.temporal_server_version}"
        ports = ["grpc", "gossip"]
        labels = {
          "com.datadoghq.ad.logs" = "[{\"source\": \"${var.log_source}\", \"service\": \"${NOMAD_GROUP_NAME}-${var.log_source}\"}]"
        }
      }

      resources {
        cpu    = 300
        memory = 400
      }
    }
  }
  ### TEMPORAL MATCHING ENDS ###

  ### TEMPORAL WORKER STARTS ###
  group "temporal-worker" {
    count = 2

    # Temporal worker uses temporal itself to run some internal workflows
    # therefore, it needs a frontend node location
    # they've now added a property called PUBLIC_FRONTEND_ADDRESS to explicitly define where the
    # front end node can be found
    # See - https://community.temporal.io/t/server-set-frontend-ip-on-worker-service/2489
    # See - https://community.temporal.io/t/error-starting-temporal-sys-tq-scanner-workflow-workflow/271
    # See - https://github.com/temporalio/temporal/pull/671
    service {
      connect {
        sidecar_service {
          tags = [
            "sidecar-proxy"
          ]
          proxy {
            upstreams {
              destination_name = "temporal-frontend-grpc"
              local_bind_port = 9200
            }
            upstreams {
              destination_name = "cql-proxy"
              local_bind_port = 9042
            }
          }
        }
      }
    }

    network {
      mode = "bridge"
      port "grpc" {}
      port "gossip" {}
    }

    task "temporal-worker" {
      driver = "docker"

      template {
        data = <<EOH
matching.numTaskqueueReadPartitions:
- value: 5
  constraints: {}
  matching.numTaskqueueWritePartitions:
- value: 5
  constraints: {}
EOH
        destination = "/local/config/dynamicconfig"
      }

      env {
        CASSANDRA_SEEDS="localhost"
        KEYSPACE = var.keyspace
        VISIBILITY_KEYSPACE = var.visibility_keyspace
        SKIP_SCHEMA_SETUP = true
        NUM_HISTORY_SHARDS = var.history_shard_count
        SERVICES = "worker"
        LOG_LEVEL = var.log_level
        DYNAMIC_CONFIG_FILE_PATH = "/local/config/dynamicconfig"
        TEMPORAL_BROADCAST_ADDRESS = "${attr.unique.network.ip-address}"
        BIND_ON_IP = "0.0.0.0"
        PUBLIC_FRONTEND_ADDRESS = "${NOMAD_UPSTREAM_ADDR_temporal_frontend_grpc}"
        WORKER_GRPC_PORT = "${NOMAD_PORT_grpc}"
        WORKER_MEMBERSHIP_PORT = "${NOMAD_PORT_gossip}"
      }

      config {
        image = "temporalio/server:${var.temporal_server_version}"
        ports = ["grpc", "gossip"]
        labels = {
          "com.datadoghq.ad.logs" = "[{\"source\": \"${var.log_source}\", \"service\": \"${NOMAD_GROUP_NAME}-${var.log_source}\"}]"
        }
      }

      resources {
        cpu    = 200
        memory = 400
      }
    }
  }
  ### TEMPORAL WORKER ENDS ###

  ### TEMPORAL FRONTEND STARTS ###
  group "temporal-frontend" {
    count = 2

    network {
      mode = "bridge"
      port "grpc" {
        to = 7233
      }
      port "gossip" {}
    }

    service {
      name = "temporal-frontend-grpc"
      port = "7233"
      check {
        type            = "grpc"
        port            = "grpc"
        interval        = "10s"
        timeout         = "2s"
        grpc_service    = "temporal.api.workflowservice.v1.WorkflowService"
      }

      connect {
        sidecar_service {
          tags = ["proxy"]
          proxy {
            upstreams {
              destination_name = "cql-proxy"
              local_bind_port = 9042
            }
          }
        }
      }
    }

    task "temporal-frontend" {
      driver = "docker"

      template {
        data = <<EOH
matching.numTaskqueueReadPartitions:
- value: 5
  constraints: {}
  matching.numTaskqueueWritePartitions:
- value: 5
  constraints: {}
EOH
        destination = "/local/config/dynamicconfig"
      }

      env {
        CASSANDRA_SEEDS="localhost"
        KEYSPACE = var.keyspace
        VISIBILITY_KEYSPACE = var.visibility_keyspace
        SKIP_SCHEMA_SETUP = true
        NUM_HISTORY_SHARDS = var.history_shard_count
        SERVICES = "frontend"
        LOG_LEVEL = var.log_level
        DYNAMIC_CONFIG_FILE_PATH = "/local/config/dynamicconfig"
        TEMPORAL_BROADCAST_ADDRESS = "${attr.unique.network.ip-address}"
        BIND_ON_IP = "0.0.0.0"
        FRONTEND_GRPC_PORT = "${NOMAD_PORT_grpc}"
        FRONTEND_MEMBERSHIP_PORT = "${NOMAD_PORT_gossip}"
      }

      config {
        image = "temporalio/server:${var.temporal_server_version}"
        ports = ["grpc", "gossip"]
        labels = {
          "com.datadoghq.ad.logs" = "[{\"source\": \"${var.log_source}\", \"service\": \"${NOMAD_GROUP_NAME}-${var.log_source}\"}]"
        }
      }

      resources {
        cpu    = 200
        memory = 300
      }
    }
  }
  ### TEMPORAL FRONTEND ENDS ###
}

Hi @animeshjain,

My response assumes that 10.0.0.* are the IP addresses of the host machines, if this is incorrect please let me know a little more about the cluster topology.

I believe the problem is that currently the Nomad bridge networking mode does not support network hairpinning. When the matching service attempts to connect to 10.0.0.1:<matching1 port>, the request must hairpin from the network namespace where the application resides, into the host machine namespace, then back into its isolated network namespace of the Matching application.

We currently have this PR open which would allow for hairpinning and resolve the situation you’re seeing. I don’t believe there is another workaround at the moment, although I am not familiar with Temporal and wonder if there is a way to configure how the Matching services talks to itself?

Thanks,
jrasell and the Nomad team

Yes you understood correctly. That is a bummer. I did post about this issue on temporal forums, linked for reference.

We probably have two options as of now,

  1. Deploy temporal without nomad
  2. Try to patch temporal ourselves to route requests to the same service instance using localhost

Hi @animeshjain,

One potential workaround would be to utilise your own CNI configuration which enabled hairpin networking. The Nomad bridge networking mode uses CNI with this configuration which could be adapted as needed, although I would urge caution to this approach.

The CNI config would look something like the below and you would reference this via the network mode parameter. Additional documentation for Nomad and CNI is available on the website.

{
	"cniVersion": "0.4.0",
	"name": "bridge-hairpin",
	"plugins": [
		{
			"type": "loopback"
		},
		{
			"type": "bridge",
			"bridge": "nomad-bridge-hairpin",
			"ipMasq": true,
			"isGateway": true,
			"forceAddress": true,
                        "hairpinMode": true,
			"ipam": {
				"type": "host-local",
				"ranges": [
					[
						{
							"subnet": "172.26.164.0/20"
						}
					]
				],
				"routes": [
					{ "dst": "0.0.0.0/0" }
				]
			}
		},
		{
			"type": "firewall",
			"backend": "iptables",
			"iptablesAdminChainName": "nomad"
		},
		{
			"type": "portmap",
			"capabilities": {"portMappings": true},
			"snat": true
		}
	]
}

Thanks,
jrasell and the Nomad team

Interesting. Thanks for the tip @jrasell, let me read up about this a bit more and test it out.

Hi @jrasell,

I created the CNI config on the nomad clients in /opt/cni/config. I updated my job to use mode = "cni/bridge-hairpin" instead of mode = "bridge". On running the job, I now get

### Plan Error

1 error occurred: * Consul Connect sidecar requires bridge network, found "cni/bridge-hairpin" in group "temporal-history"

Any way to get around this?