Nomad bridge network cni plugin fails with many allocations on host

tommy · January 16, 2024, 6:31pm

I have a strange issue where Nomad’s use of the CNI bridge plugin fails when the number of containers on my host gets around 30. Up until that number everything works just fine, but around 30 I start getting the below error.

This is really odd as I’ve had more than 40+ containers running successfully on the host in the past, but am suddenly getting this error for some unknown reason. I’ve exercised my google fu to the max, but I can’t find a single mention of this issue.

I’m running Nomad on a single server/client node.

Setup Failure: 
  failed to setup alloc: pre-run hook "network" failed: 
  failed to configure networking for alloc: 
  failed to configure network:
  plugin type="bridge" failed (add): bridge port in error state: down

There is nothing fancy in my network configuration for the job:

    network {
      mode = "bridge"

      port "http" {
        to = 80
      }
    }

My environment:
OS: Void Linux
Nomad: 1.7.2
CNI: 1.3.0

Ranjandas · January 17, 2024, 11:05pm

Hi @tommy,

It looks like the error is coming from the bridge CNI plugin failing to bring up the host veth interface. Both the CNI plugin and Nomad do a couple of retries and, finally, errors. So probably an underlying host issue.

ref:

Assuming you are using the default Nomad networking settings, can you share the output of the following command?

ls -1 /sys/class/net/nomad/brif/  | wc -l

tommy · January 18, 2024, 7:52am

Thank you for responding @Ranjandas, you are correct in assuming that I’m using default Nomad networking settings, I don’t have anything exotic in my configuration.

Here’s the output you requested:

❯ ls -1 /sys/class/net/nomad/brif/  | wc -l

24

This was a bit surprising to me, that there are only 24, as I was hoping to see that for some reason there would be leftover interfaces or something. I’ve heard of people running several hundred allocations on one host, so this is very odd.

I’m also inclined to think that there is something underlying on the host causing this issue, as there is only one host in my homelab environment experiencing this.

Very keen on investigating this further, so your input is much appreciated!

Ranjandas · January 18, 2024, 11:57am

Here are some of the questions I have:

Do you have other hosts in your home lab running VoidLinux with the same kernel version, and everything working fine?
What hardware is it? x86 Servers, RaspberryPI…etc?
Did you check the Kernel logs when things are failing?

To be honest, I don’t know what could be causing it. However, considering you agree it is not related to Nomad, I would recommend trying the following steps so that you can get closer to the issue and figure out the root cause. This would speed up the troubleshooting as well. The following are the high level steps.

compile a copy of cnitool (you need go for this)

git clone https://github.com/containernetworking/cni
cd cni/cnitool
go build cnitool.go

Create the nomad CNI conflist in the current directory. Copy the contents of this JSON to a file named nomad.conflist

Now, try to manually create namespaces and create and configure the veth pairs using cnitool.

sudo ip netns add <new-namespace-name>
sudo NETCONFPATH=. CNI_PATH=/opt/cni/bin/ ./cnitool add nomad /var/run/netns/<new-namespace-name>

You can try creating many of these by randomising the namespace name. While doing this, observe the kernel logs and see if you can find anything to help explain the root cause.

I couldn’t reproduce the issue, so my input will be minimal.

tommy · January 18, 2024, 7:57pm

Yeah I have two more Void Linux clients running in my environment, none of them have this problem with the bridge failing. I haven’t stress tested either since this failing node of my is running far more containers than the other clients, I could do a test where I run a larger number of cotnainers on one of the other hosts as well.

The hardware is x86, the host with the issue is an AMD Ryzen based machine.

I haven’t checked the kernel logs yet, I will try to see if I can do that. The only thing I’ve seen that’s odd in logs was just now, I see that Nomad is logging the below for some reason. Not sure if this would affect the bridge network though, since the bridge networks work well up until a certain point.

2024-01-18T19:54:38.41160 daemon.notice: Jan 18 20:54:38 nomad:     2024-01-18T20:54:38.411+0100 [ERROR] nomad: failed to reconcile: error="error removing server with duplicate ID \"c5c9e5c8-750c-ea38-da41-f498584f37ae\": need at least one voter in configuration: {[]}"

I will try to follow your suggestions for troubleshooting and post back again as soon as I’ve been able to do so.

tommy · January 19, 2024, 4:14pm

@Ranjandas I now checked kernel logs while starting a nomad job, which again failed with the bridge port down error, and below are the logs from that event. The same errors keep repeating when Nomad retries with a new container allocation.

2024-01-19T16:12:55.51155 kern.info: [12665.858023] nomad: port 17(vethc92c984f) entered blocking state
2024-01-19T16:12:55.51156 kern.info: [12665.858030] nomad: port 17(vethc92c984f) entered disabled state
2024-01-19T16:12:55.51157 kern.info: [12665.858048] vethc92c984f: entered allmulticast mode
2024-01-19T16:12:55.51158 kern.info: [12665.858182] vethc92c984f: entered promiscuous mode
2024-01-19T16:13:00.28254 kern.info: [12670.629565] vethc92c984f (unregistering): left allmulticast mode
2024-01-19T16:13:00.28257 kern.info: [12670.629570] vethc92c984f (unregistering): left promiscuous mode
2024-01-19T16:13:00.28258 kern.info: [12670.629574] nomad: port 17(vethc92c984f) entered disabled state

Update; I did another run and got this in the kernel logs. It’s almost the same, but I noticed that the port is referenced by two different veth IDs, is that normal?

2024-01-20T08:07:40.50958 kern.info: [69950.839512] nomad: port 17(vethe84596d8) entered disabled state
2024-01-20T08:08:10.67453 kern.info: [69981.004497] nomad: port 17(vethcb4e77ba) entered blocking state
2024-01-20T08:08:10.67455 kern.info: [69981.004504] nomad: port 17(vethcb4e77ba) entered disabled state
2024-01-20T08:08:10.67456 kern.info: [69981.004522] vethcb4e77ba: entered allmulticast mode
2024-01-20T08:08:10.67457 kern.info: [69981.004628] vethcb4e77ba: entered promiscuous mode

Topic		Replies	Views
Failed to find plugin "bridge" in path Nomad	3	6564	March 26, 2024
Pre-run hook network failed explanations? Nomad connect	4	2123	February 7, 2022
Custom CNI network Nomad	2	1740	November 24, 2021
Placement failure due to cni version Nomad	13	500	August 29, 2024
How to configure Nomad to reference CNI plugins location? Nomad	3	670	December 17, 2023

Nomad bridge network cni plugin fails with many allocations on host

Related topics