Issue with CRL too big

Hi, I’ve been having an issue with the number of leases I have with vault. I’m currently getting this warning:

expiration: lease count exceeds warning lease threshold: have=1216354 threshold=256000

After a long investigation, I found out that some of my servers were stuck in a loop generating pki certificates for a few months every minutes.
I have found a way to revoke those useless certificate but I’m currently running into the following issue.

When trying to revoke a cert (which was working for some time) I’m now getting this error message:

Error writing data to pki_int/revoke: Error making API request.

URL: PUT https://ord-vault01-001.ludia.me:8200/v1/pki_int/revoke
Code: 500. Errors:

* 1 error occurred:
        * error encountered during CRL building: error storing CRL: put failed due to value being too large; got 1106063 bytes, max: 1048576 bytes

I have tried to rotate the CRL, but I get a similar error:

vault read /pki_int/crl/rotate
Error reading pki_int/crl/rotate: Error making API request.

URL: GET https://ord-vault01-001.ludia.me:8200/v1/pki_int/crl/rotate
Code: 500. Errors:

* 1 error occurred:
	* error encountered during CRL building: error storing CRL: put failed due to value being too large; got 1106258 bytes, max: 1048576 bytes

Also, running a tidy doesn’t help. I use the following command:

vault write pki_int/tidy tidy_cert_store=true tidy_revoked_certs=true safety_buffer=1s

I see a lot of IO happenning for a few minutes, then it stops, and problem isn’t resolved.

Any Idea how to be able to continue my cleanup and restore crl rotation ?

Thank you

configuration:

vault version
Vault v1.8.2 (aca76f63357041a43b49f3e8c11d67358496959f)

  "storage": {
    "raft": {
      "path": "/opt/data/vault/storage",
      "node_id": "ord-vault01-001",
      "retry_join": {
        "leader_api_addr": "https://vault.ludia.me",
        "leader_ca_cer_file": "/etc/pki/tls/private/web_ord-vault01.ludia.me.ca-bundle",
        "leader_client_cert_file": "/etc/pki/tls/private/web_ord-vault01.ludia.me.crt",
        "leader_client_key_file": "/etc/pki/tls/private/web_ord-vault01.ludia.me.key"
      }
    }
  },

Try running a tidy to actually remove them.

Oh, i’m sorry, failed to mention that tidy doesn’t help. Updating my post accordingly.

How long does the tidy run for and stop? Should be log messages.

I don’t see anything in the logs when the tidy starts. However I see the following error in the logs

Oct 07 11:31:44 ord-vault01-003 vault[2255]: 2021-10-07T11:31:44.086Z [ERROR] secrets.pki.pki_c031df45.tidy: error running tidy: error="error storing CRL: put failed due to value being too large; got 1106219 bytes, max: 1048576 bytes"

Look at this, it has some suggestions on how to reduce the limit

I’ve been able to temporarily get rid of the “value being too large” by adding this in my vault config:

"max_entry_size": 2097152,

The rotate now works, but the tidy doesn’t seem to complete as I don’t see anything in the logs about tidy in 24h.
also, now trying to revoke certificates takes over 3 seconds for each revoke (i was able to revoke 3/s when I started the process). This seems to indicate that the CRL has not been cleaned. I did the CRL rotate, but it doesn’t help.

Am I hitting a bug here?

I can’t afford to put my vault cluster in maintenance to fiddle with /sys/raw since it’s used for production environment.
And yes, I need to use CRL, can’t afford to disable it.

What parameters are you passing to tidy?

Can you calc the time from when you manually fire it to the error?
I am thinking you can bump up the timeout on the listener and it’ll run…

It’s in my initial message.
vault write pki_int/tidy tidy_cert_store=true tidy_revoked_certs=true safety_buffer=1s

Have you tried to run the tidy again CA and not the INT?

Also 1 second may not be a valid parameter, try going with what’s in the documentation “24h”.

I created 100 certs about a week ago (some app deployment went off the rails).
I used this and it worked, removed all of them from the store:

curl \
    --header "X-Vault-Token: <token>" -X POST -d '{ "safety_buffer": "24h" }'  http://vault:8200/v1/pki/tidy

response:

{"request_id":"","lease_id":"","renewable":false,"lease_duration":0,"data":null,"wrap_info":null,"warnings":["Tidy operation successfully started. Any information from the operation will be printed to Vault's server logs."],"auth":null}

I did try on the CA, but it’s no use. There is only a handfull of cert which are my “intermediates”.
Running the tidy with “24h” have the same result. I get the same output as you also when running with “1s”.

My main concern at this point is the CRL, I can now rotate it since I modified “max_entry_size”: 2097152, But trying to revoke more certs is very slow. 1 revoke queries per 3s now.

I think the CRL is still big and doesn’t get cleaned because my certs were valid for a year, for the CRL to work I guess it would need to keep the certs for a year even so it can tell it’s revoked.

  • Is there a non disruptive way to empty the CRL, knowing the risks of not being able to tell a cert is revoked?
  • Also how can I see the size of the CRL and the number of entries in it?

Correct - tidy will apply to only all invalid and expired certificates

You can hit the pki/crl API and pull down the CRL… then parse and count from there.

For others following along - now would be a good time to implement rate limits/quotas to avoid getting into a position like this: Resource Quotas | Vault | HashiCorp Developer

For my current situation is there a way to force delete (not revoke) PKI certificates as I have a lot of useless/lost certificates because of my bad bahaving clients that were generating a lot of PKI requests, and then be able to run a tidy to clean the CRL?
I’m able to identify which serials to be deleted, just need a way to delete them without starting from scratch.

Without DB surgery I don’t think so. I think making a new PKI mount, rotating all your valid cert consumers to that, then revoking this PKI mount will be your only option.