Using Consul-Template to update certificates for Consul with Vault as the CA

I’ve been setting up a little HashiPi stack at home just to get my head around how to use these tools. It’s all been pretty great, with a few bumps here and there.

I’ve made my way through the Learn guides for Vault, Nomad, Consul, and Consul-Template, but there aren’t (outside of a few points of integration) a lot of guides that walk you through workflows when using them all together. I’ve run into a situation and could use some advice on best practices.

I’ve got Nomad and Consul set up to use TLS. The certificates are all generated using Vault as their CA. Consul-Template is printing these certs out via the vault integration. Nomad and Consul are set to reload when new certs are issued.

This led to some problems. Vault uses Consul for service registration, and I’m using commands after template renderings to reload Consul. With a reload command executing after each template rendering, Consul trying to reload after the cert gets rendered, but before the key. This of course causes Consul to explode, which then causes Vault to start to explode, as it’s using Consul for service registration and I am trying to follow best practices and route things to things like vault.service.consul so I’m not reliant on a single named node or ip being available. This also causes Nomad to then explode, and eventually all active jobs as Vault is no longer available to generate certs or retrieve secrets. It’s actually a fairly marvelous demolition-like implosion. :stuck_out_tongue:

If I take out all the commands from the Consul cert config for CT and periodically, manually restart Consul everything works great. But this is DevOps. Who want’s to manually restart things?

So… I’m attempting some remediations, but I’m not sure what just “seems” like a good approach, or what is actually an anti-pattern. Here are my questions thus far:

  1. I presume I should be using the exec {} block in my consul-template config and not run commands during each template rendering? Is that the best way to be making sure I only reload Consul and Nomad after all of their respective certs have been rendered? If so, and (as an example) the ca.crt doesn’t need to be renewed and CT skips it, does the exec command still fire? I could be reading this wrong, but the docs seems to indicate that if ANY of them don’t render for whatever reason the exec command won’t fire. Any clarity here would be appreciated.

  2. In order to reload Consul (as opposed to restarting), Consul-Template needs to have access to a sufficiently-permissioned token. The Vault integration for CT has an attribute I can set to point to a vault_agent_token_file, but curiously, there isn’t a similar attribute I can find for pointing to a consul token file from the CT config. It seems like it has to be exported via CLI in advance of any commands, hard-coded into the CT config, or stored in .bashrc/.bash_profile/etc, which is not a safe/great place to store sensitive values. So what is the safest/recommended way to set the CONSUL_HTTP_TOKEN var in such a way that consul-template won’t have any trouble accessing it from a cold boot?

  3. What’s the best way to renew Vault’s certificates? The Vault server’s certs currently have a TTL of 30d, but I am trying to get to a place where every cert everywhere is renewed every 24h. Using Consul-Template to reach into Vault to generate new certificates for Vault feels… dangerous. Is there a recommended way to “Indiana Jones” in the new certs for Vault using Consul-Template without the temple collapsing in around me?

Thanks in advance,
Sam

…Buehler? Still trying to figure this out.

I would also append to this question as I’ve been having trouble getting my head around how to strategize my certificate issuance.

Originally, when generating my certificates from consul-template, I was giving each agent, whether client or server, the most limited common_name and alt_names, only 127.0.0.1 for ip_address.

I have since run into situations (particularly when trying to get Prometheus to scrape metrics) where the node ip needed to be included in the certificate. This pattern, while seemingly best-practice, ended up producing a lot certs needing to be issued and renewed, which on my RPi 4’s with 4GB of RAM was causing Vault to completely collapse. I’m setting up the cluster again with the new 8GB variants and crossing my fingers.

All this has got me thinking whether I should be refactoring my cert strategy. I would like to keep a low risk tolerance, as this is my training bed to roll this stuff out at work eventually, and I care about my privacy / how hackable my local network is.

Is it a bad idea to just issue a single cert that all agents (regardless of product) can use? My thinking was that I would have a cert / role for all the nodes, an additional cert / role for mTLS. My reasoning being, I was constantly running into certs that wouldn’t renew, Vault issues, etc, causing cascading failures and a sizable cleanup process to get all the certs re-issued and deployed securely to each Pi.

How are folks keeping this manageable for their own setups?

Thanks in advance,
Sam