Better visibility and control over Docker IO

Xopherus · August 21, 2020, 2:18pm

Does the nomad team / community have anything on the horizon to improve the experience around managing docker IO? My team occasionally has to deal with network out spikes in our clients, which causes a host of networking issues with the containers which run on it. One issue is related to https://github.com/hashicorp/nomad/issues/5718. During these network spikes, without fail we’ll have new docker images fail to pull. Unfortunately it’s been very hard for us to pinpoint exactly where the network usage is coming from, because we don’t have any per-container metics on IO usage.

Any suggestions for workarounds, or similar experiences would be appreciated!

jrasell · August 24, 2020, 12:50pm

Hi @Xopherus, as you mention, trying to identify network issue issues can be very difficult especially without the correct monitoring in place.

In addition to any server provider specific monitoring which may be available, the following items my help provide better insights that you’re looking for (they are Prometheus specific where possible for a form of consistency).

Prometheus Node Exporter - expose and collect hardware and OS metrics exposed by *NIX kernels that includes netstat output.
Google cAdvisor - provides resource usage and performance characteristics of running containers including a number of network related metrics.

I don’t believe Nomad would be the correct tool in the chain to provide additional Docker system metrics, although better control could be possible if the Docker API provides such controls.

I hope this helps.

jrasell and the Nomad team.