Issue with CRL too big

jsfrerot1 · October 6, 2021, 6:58pm

Hi, I’ve been having an issue with the number of leases I have with vault. I’m currently getting this warning:

expiration: lease count exceeds warning lease threshold: have=1216354 threshold=256000

After a long investigation, I found out that some of my servers were stuck in a loop generating pki certificates for a few months every minutes.
I have found a way to revoke those useless certificate but I’m currently running into the following issue.

When trying to revoke a cert (which was working for some time) I’m now getting this error message:

Error writing data to pki_int/revoke: Error making API request.

URL: PUT https://ord-vault01-001.ludia.me:8200/v1/pki_int/revoke
Code: 500. Errors:

* 1 error occurred:
        * error encountered during CRL building: error storing CRL: put failed due to value being too large; got 1106063 bytes, max: 1048576 bytes

I have tried to rotate the CRL, but I get a similar error:

vault read /pki_int/crl/rotate
Error reading pki_int/crl/rotate: Error making API request.

URL: GET https://ord-vault01-001.ludia.me:8200/v1/pki_int/crl/rotate
Code: 500. Errors:

* 1 error occurred:
	* error encountered during CRL building: error storing CRL: put failed due to value being too large; got 1106258 bytes, max: 1048576 bytes

Also, running a tidy doesn’t help. I use the following command:

vault write pki_int/tidy tidy_cert_store=true tidy_revoked_certs=true safety_buffer=1s

I see a lot of IO happenning for a few minutes, then it stops, and problem isn’t resolved.

Any Idea how to be able to continue my cleanup and restore crl rotation ?

Thank you

configuration:

vault version
Vault v1.8.2 (aca76f63357041a43b49f3e8c11d67358496959f)

  "storage": {
    "raft": {
      "path": "/opt/data/vault/storage",
      "node_id": "ord-vault01-001",
      "retry_join": {
        "leader_api_addr": "https://vault.ludia.me",
        "leader_ca_cer_file": "/etc/pki/tls/private/web_ord-vault01.ludia.me.ca-bundle",
        "leader_client_cert_file": "/etc/pki/tls/private/web_ord-vault01.ludia.me.crt",
        "leader_client_key_file": "/etc/pki/tls/private/web_ord-vault01.ludia.me.key"
      }
    }
  },

aram · October 6, 2021, 8:45pm

Try running a tidy to actually remove them.

jsfrerot1 · October 7, 2021, 11:26am

Oh, i’m sorry, failed to mention that tidy doesn’t help. Updating my post accordingly.

mikegreen · October 7, 2021, 2:36pm

How long does the tidy run for and stop? Should be log messages.

jsfrerot1 · October 7, 2021, 5:33pm

I don’t see anything in the logs when the tidy starts. However I see the following error in the logs

Oct 07 11:31:44 ord-vault01-003 vault[2255]: 2021-10-07T11:31:44.086Z [ERROR] secrets.pki.pki_c031df45.tidy: error running tidy: error="error storing CRL: put failed due to value being too large; got 1106219 bytes, max: 1048576 bytes"

aram · October 7, 2021, 11:21pm

Look at this, it has some suggestions on how to reduce the limit

github.com/hashicorp/vault

PKI tidy appears to noop

opened 03:08PM - 09 Jul 18 UTC

closed 05:43PM - 10 Jul 18 UTC

dmicanzerofox

Greetings, we are running into a critical production issue where we are trying t…o tidy the CRL and it doesn't appear to be doing anything. In our investigation, we also noticed that the raw consul value of the CRL is over the consul limit of 512kB. ``` # vault version Vault v0.10.2 ('3ee0802ed08cb7f4046c2151ec4671a076b76166') ``` ``` # consul version Consul v1.0.6 Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents) ``` ### Unable to parse the CRL using openssl we were able to pull down the pem initially. ``` curl --header "X-Vault-Token: $VAULT_TOKEN" https://X.X.X.X:8200/v1/pki/aws-us-west-2/crl/pem -k > vault-crls.pem $ openssl crl -inform PEM -in vault-crls.pem unable to load CRL 139994909750936:error:0906D06C:PEM routines:PEM_read_bio:no start line:pem_lib.c:701:Expecting: X509 CRL ``` But after the tidy we are now unable to pull it down at all ``` $ curl --header "X-Vault-Token: XX” https://X.X.X.X:8200/v1/pki/aws-us-west-2/crl {"errors":["internal error"]} ``` ### Unable to rotate ``` $ AULT_CLIENT_TIMEOUT=600 vault read pki/aws-us-west-2/crl/rotate Error reading pki/aws-us-west-2/crl/rotate: Error making API request. URL: GET https://X.X.X.X:8200/v1/pki/aws-us-west-2/crl/rotate Code: 500. Errors: * 1 error occurred: * error encountered during CRL building: error storing CRL: Unexpected response code: 413 (Value exceeds 524288 byte limit) ``` ### Unable to reiissue intermediary ``` vault write pki/aws-us-west-2/intermediate/set-signed \ certificate="-----BEGIN CERTIFICATE----- MIIDFzCCAp6gAwIBAgIUSO9c8KtwNzQovitzqwDHV3HvC+cwCgYIKoZIzj0EAwIw HTEbMBkGA1UEAxMSWmVyb0ZveCBSb290IENBIFFBMB4XDTE4MDcwOTEzMDkyOFoX DTE4MDkwNzEzMDk1OFowJzElMCMGA1UEAxMcWmVyb0ZveCBJQ0EgUUEgYXdzLXVz LXdlc3QtMjB2MBAGByqGSM49AgEGBSuBBAAiA2IABBPHwFpV9TkvJ7Xi4lWjI0NK c9mKOQ5GsVvRGZF2VfPZ5Xceqzqs8V2F8ncy3QsBk3f7N4PYjWMIpko9KiDBg5dn ubR9BBcdf2/EaCSJxhSfSH/n/Q1yWKiVPnz6Pc7viaOCAZMwggGPMA4GA1UdDwEB /wQEAwIBBjASBgNVHRMBAf8ECDAGAQH/AgEAMB0GA1UdDgQWBBRTxycb92bgKOyq 6SBMxWrNJUcw4zAfBgNVHSMEGDAWgBTKK8N8DgvdjFDmCnQVq3XmuKAOJTCBmgYI KwYBBQUHAQEEgY0wgYowSgYIKwYBBQUHMAKGPmh0dHBzOi8vdmF1bHQuc2Vydmlj ZS5hd3MtdXMtd2VzdC0yLmNvbnN1bDo4MjAwL3YxL3BraS9yb290L2NhMDwGCCsG AQUFBzAChjBodHRwczovL3ZhdWx0LnNlcnZpY2UuY29uc3VsOjgyMDAvdjEvcGtp L3Jvb3QvY2EwgYsGA1UdHwSBgzCBgDBFoEOgQYY/aHR0cHM6Ly92YXVsdC5zZXJ2 aWNlLmF3cy11cy13ZXN0LTIuY29uc3VsOjgyMDAvdjEvcGtpL3Jvb3QvY3JsMDeg NaAzhjFodHRwczovL3ZhdWx0LnNlcnZpY2UuY29uc3VsOjgyMDAvdjEvcGtpL3Jv b3QvY3JsMAoGCCqGSM49BAMCA2cAMGQCMGbG0QGEz8B/ITNs/IApRki26cwnGF8s 83ZnEK41xKAwSffSsPTcJ+VQCQrEa7f03QIwQB8Hzb5QLy98iKdsokkm9ChFsnM4 lxbL9uupG6K3DtMkh/WnJzG0HPwAlVJlP8FL -----END CERTIFICATE-----" Error writing data to pki/aws-us-west-2/intermediate/set-signed: Error making API request. URL: PUT https://X.X.X.X:8200/v1/pki/aws-us-west-2/intermediate/set-signed Code: 500. Errors: * 1 error occurred: * error storing CRL: Unexpected response code: 413 (Value exceeds 524288 byte limit) ``` ### Dumping consul backend: We dumped the vault consul tree in order to see the size of the key. We loaded the json values in python in order to get the length of the value associated with the crl: `key': 'vault/logical/06515fe5-b2a2-ee9a-5cc3-fb54d0bf68a1/crl',` ``` In [12]: len(crls[0]['value']) Out[12]: 699036 ``` ### CRL TIDY We've executed tidy last week and multiple times today. ``` $ VAULT_CLIENT_TIMEOUT=600 vault write pki/aws-us-west-2/tidy tidy_revocation_list=true Success! Data written to: pki/aws-us-west-2/tidy ``` After issuing this command we observed vault doing work, but the CRL is still the same size in consul. ``` $ consul monitor -log-level=TRACE ``` ### Resources - Consul/vault CPU and IOPS look healthy I'm hoping that you could help us: - How might this CRL value be larger than the maximum allowed by consul? - What would you do in this situation in order to reissue the intermediary? We appreciate your time and are able to get any other information that may be helpful. Thank you

jsfrerot1 · October 8, 2021, 12:13pm

I’ve been able to temporarily get rid of the “value being too large” by adding this in my vault config:

"max_entry_size": 2097152,

The rotate now works, but the tidy doesn’t seem to complete as I don’t see anything in the logs about tidy in 24h.
also, now trying to revoke certificates takes over 3 seconds for each revoke (i was able to revoke 3/s when I started the process). This seems to indicate that the CRL has not been cleaned. I did the CRL rotate, but it doesn’t help.

Am I hitting a bug here?

I can’t afford to put my vault cluster in maintenance to fiddle with /sys/raw since it’s used for production environment.
And yes, I need to use CRL, can’t afford to disable it.

aram · October 8, 2021, 1:03pm

What parameters are you passing to tidy?

mikegreen · October 8, 2021, 10:31pm

Can you calc the time from when you manually fire it to the error?
I am thinking you can bump up the timeout on the listener and it’ll run…

jsfrerot1 · October 9, 2021, 9:35pm

It’s in my initial message.
vault write pki_int/tidy tidy_cert_store=true tidy_revoked_certs=true safety_buffer=1s

aram · October 10, 2021, 9:44am

Have you tried to run the tidy again CA and not the INT?

Also 1 second may not be a valid parameter, try going with what’s in the documentation “24h”.

I created 100 certs about a week ago (some app deployment went off the rails).
I used this and it worked, removed all of them from the store:

curl \
    --header "X-Vault-Token: <token>" -X POST -d '{ "safety_buffer": "24h" }'  http://vault:8200/v1/pki/tidy

response:

{"request_id":"","lease_id":"","renewable":false,"lease_duration":0,"data":null,"wrap_info":null,"warnings":["Tidy operation successfully started. Any information from the operation will be printed to Vault's server logs."],"auth":null}

jsfrerot1 · October 12, 2021, 11:57am

I did try on the CA, but it’s no use. There is only a handfull of cert which are my “intermediates”.
Running the tidy with “24h” have the same result. I get the same output as you also when running with “1s”.

My main concern at this point is the CRL, I can now rotate it since I modified “max_entry_size”: 2097152, But trying to revoke more certs is very slow. 1 revoke queries per 3s now.

I think the CRL is still big and doesn’t get cleaned because my certs were valid for a year, for the CRL to work I guess it would need to keep the certs for a year even so it can tell it’s revoked.

Is there a non disruptive way to empty the CRL, knowing the risks of not being able to tell a cert is revoked?
Also how can I see the size of the CRL and the number of entries in it?

mikegreen · October 14, 2021, 2:49pm

Correct - tidy will apply to only all invalid and expired certificates

You can hit the pki/crl API and pull down the CRL… then parse and count from there.

For others following along - now would be a good time to implement rate limits/quotas to avoid getting into a position like this: Resource Quotas | Vault | HashiCorp Developer

jsfrerot1 · October 14, 2021, 3:12pm

For my current situation is there a way to force delete (not revoke) PKI certificates as I have a lot of useless/lost certificates because of my bad bahaving clients that were generating a lot of PKI requests, and then be able to run a tidy to clean the CRL?
I’m able to identify which serials to be deleted, just need a way to delete them without starting from scratch.

mikegreen · October 14, 2021, 10:30pm

Without DB surgery I don’t think so. I think making a new PKI mount, rotating all your valid cert consumers to that, then revoking this PKI mount will be your only option.

Topic		Replies	Views
Unable to access CRL for intermediate-only CA Vault	1	1089	April 5, 2023
HCSEC-2021-09 - Vault’s PKI Engine CRL May Exclude Revoked But Unexpired Certificates After Tidy Security security-vault	0	8469	April 21, 2021
PKI storage revoke slowly Vault	0	204	June 2, 2021
PKI tidy removes revoked but not expired certificate Vault	0	324	March 23, 2021
PKI tidy didn't remove revoked certificates Vault	8	2942	April 21, 2022

Issue with CRL too big

Related topics