Vault not allowing colon in the certificate CN

Hello,

I am generating my kubernetes certificates using Vault through terraform, but I’ve come across this issue where certain client certificates, such as those of controller-manager or scheduler, have a system: prefix, for example:

Subject: CN = system:kube-scheduler

I’m using this certificate role through terraform that is used by these certificates:

resource "vault_pki_secret_backend_role" "kube1" {
        depends_on = [ vault_pki_secret_backend_intermediate_set_signed.kube1_etcd_ca ]
        backend = vault_mount.kube1_ca.path
        name = "kubernetes"
        ttl = 283824000
        allow_ip_sans = true
        key_type = "rsa"
        key_bits = "2048"
        allow_any_name = true
        allowed_domains = ["*"]
        allow_glob_domains = true
        allow_subdomains = true
}

And this is how I’m generating one of the certificates:

resource "vault_pki_secret_backend_cert" "kube1_scheduler" {
        for_each  = var.kubernetes_servers
        depends_on = [ vault_pki_secret_backend_role.kube1 ]
        backend = vault_mount.kube1_ca.path
        name = vault_pki_secret_backend_role.kube1.name
        common_name = "system:kube-scheduler"
        ttl = 283824000
}

Is there an option that allows : in the CN or is it simply impossible to do this in vault?
This is the error that I’m getting:

╷
│ Error: error creating certificate system:kube-scheduler by kubernetes for PKI secret backend "kube1-ca": Error making API request.
│
│ URL: PUT https://vault-0.company.internal:8200/v1/kube1-ca/issue/kubernetes
│ Code: 400. Errors:
│
│ * common name system:kube-scheduler not allowed by this role
│
│   with vault_pki_secret_backend_cert.kube1_scheduler["kube-controlplane-2"],
│   on certificates.tf line 407, in resource "vault_pki_secret_backend_cert" "kube1_scheduler":
│  407: resource "vault_pki_secret_backend_cert" "kube1_scheduler" {
│
╵

You will need to set cn_validations to disabled on the role to allow non-hostnames: PKI - Secrets Engines - HTTP API | Vault | HashiCorp Developer

The option doesn’t exist in terraform though. So this doesn’t seem to help a lot.
https://registry.terraform.io/providers/hashicorp/vault/latest/docs/resources/pki_secret_backend_role

Yes, this is one of the downsides of using Terraform - providers needs an update to expose each new feature of the platform.

You’d need to either make and build your own version of terraform-provider-vault temporarily, or chase HashiCorp to make the update.

I’ve opened a request for this feature, but this all seems hopeless. It’s quite clear that Hashicorp is not particularly friendly towards self-hosted setups. The obstacoles are unending. Maybe I’m wrong, I don’t know, I might be missing something, but it’s been quite some time already since I’ve been using all these tools, terraform, packer, vault, consul.

So until now nobody has successfully created kubernetes certificates using Vault through Terraform, if I understand this correctly. You’d have expected complete compatibility given that we’re not talking about some obscure software, it’s really the thing people talk about most.

It’s been just a terrible experience, to put it bluntly :slight_smile:

I think that’s an over-generalisation.

I do agree that there are parts of the products which are not as polished as I would hope they would be, but I don’t think a lack of support for self-hosted setups is the problem.

In my opinion, the problem you are experiencing here is that Vault’s PKI support is narrowly focussed on creating TLS certificates for basic HTTPS use-cases only, and the support for more flexible use-cases is immature, missing, or incomplete.

I agree with you that the Vault PKI secrets engine is not well suited for acting as a Kubernetes cluster CA.

I have similar issues with terraform when trying to provision clusters (consul, vault, kubernetes, whatever) where one node needs information (tokens, data) from another. The code becomes terribly redundant because of this.
I don’t think it’s an over-generalisation. People (and Hashicorp) rely on whatever the cloud has to offer and which automatically solves these issues, keeping your code at least sane. There are so many hacks that you have to do over and over to be able to set up the basic things on a self-hosted setup. I’m not saying I’m doing everything right, I’m sure there are many things I could do more elegantly even within these limits, but it’s still terribly easy to lose perspective. Public clouds abstract away a huge amount of logic, and that’s why it all seems (and it as from the perspective of the user) reasonable. It’s not anymore when you yourself have to build the tools for yourself.

Just as a side note: I also use proxmox. In packer a bug was introduced where qcow2 as a storage type is simply no longer being accepted (not supported). So you simply cannot create virtual machines anymore with that type of storage. You’d call that a huge bug under normal circumstances, but given that it’s Proxmox, it’s low to no-priority. It’s been months and nobody is taking care of it, because Proxmox and, to a lesser extent, it’s true, self-hosting don’t really matter :slight_smile:

Running Vault and Consul on plain old VMs, built and installed manually, is a valid way of self-hosting. Back when Vault and Consul were being concieved, you know, back in the 0.x days, it may even have been the only expected target environment.

I’m not saying that that’s a good thing, I’m just saying that “hard to self-host” and “hard to completely ephemerally provision brand new clusters in a fully automated manner” aren’t quite the same thing, and precision helps guide conversations in useful directions.

1 Like

That’s indeed a fair point, those are two different things. Having said that, Terraform is all about automation, and I’d have expected more elegant means to provision virtual machines (less redundant code, also when it comes to modules etc.) and I’d have expected better compatibility at least between their own products.

Regarding proxmox and packer: There’s also another error where it complains if I unmount the CD when creating the image. The way that I do it now is to pin the packer version so that I don’t encounter the qcow2 bug again, and also to force it ignore the errors at the end, meaning, they it doesn’t delete the virtual machine. Or I simply tell it not to unmount the CD-ROM anymore, but that’s an issue, because the clones will inherit that :slight_smile: It’s embarrassing. But ok, I’m finished with my diatribe.

I guess what I’m trying to do is kind of overkill anyway, so I’ll just probably let kubeadm manage the certificates. Even if I solve this issue, there’s still a lot to do and I’m not sure it’s worth generating all the certificates myself anyway, given the complexities of kubernetes itself.

Later edit: the certificates for the etcd cluster I’ll still have them issued through vault, as it’s also easy to add them to the initial control plane node.

1 Like