Suggested Strategy for Run-Once Kubernetes Jobs with Terraform?

Hey all,

A while back, our team went all in on Terraform. One of our most used providers is the Kubernetes providers – the ability to interpolate values from GCP and other sources was solid enough that it eliminated our need for self-made Helm charts :slight_smile:, and there’s basically no YAML in our infrastructure code base anymore. All deployments, cronjobs, services, etc. are defined in TF HCL.

But there’s one thing I keep running into – Because TF is declarative, it keeps trying to create Jobs that we’ve defined but only needed to run once. So the jobs are written to not do any harm if run multiple times. But this doesn’t seem ideal.

I found this thread on provider’s git repo:
kubernetes_job · Issue #86 · hashicorp/terraform-provider-kubernetes · GitHub

Where it seems like the maintainer was arguing that TF just isn’t the right tool for the job. Fair enough. So what is?

I started looking through our collections of HCL-defined jobs and they’re not exactly straightforward – lots of envars and volumes being mounted, rather long scripts, etc. Some other objects depend on the jobs to run via depends_on, and other resources are pulling values from the job (referencing labels, etc.).

If we wanted to keep interpolating strings from HCL, would the recommended play be to start with a local object, pass it through jsonencode, then have use provision scripts to call kubectl apply - using the encoded json?

How do you all handle this?

Thanks!

I don’t quite understand - the issue you linked to seems to have been closed following the successful implementation of a job resource in the kubernetes provider?

As Terraform is declarative, you will be stuck with needing to leave the completed jobs defined in the Kubernetes API server forever … But provided nothing is removing them, this workflow should just work?

If you have a requirement to have the jobs disappear from the kubernetes API server after a while, and not be recreated, then I would propose that - since you’re apparently heavily invested into Terraform and HCL - that you might want to look into modifying terraform-provider-kubernetes.

In theory, it could support optional behaviour for jobs, where it modifies its “refresh” behaviour to just pretend that a completed job still exists when it disappears from the Kubernetes API server. It would mean some more up front work, to either fork the provider, or contribute upstream the necessary change, but it is likely to be a much better solution in the long term, rather than moving just jobs to a different way of definition.

1 Like

That’s the main issue here – for some of our clusters, the default behavior when TTL isn’t set seems to be to delete the job immediately.

In other cases, no-ttl-set behavior is default (so no deletion), but developers will clear out jobs during a debug session to get rid of clutter. And so subequent terraform applys will end up recreating the job (and failing, if the script isn’t configured to fail gracefully – so trying to create the job with every apply until the script is configured).

But I agree with you – the current setup is the most maintainable. Was just curious to see if anyone had come up with a solution that’s both creative and maintainable :slight_smile:

Would using a variable-based condition help? E.g.

variable "jobs_enabled" {
  default = false
}

resource "kubernetes_job" "myjob" {
  count = var.jobs_enabled ? 1 : 0
  // ...
}
# Run jobs
terraform apply -v='jobs_enabled=true'

# Cleanup jobs
terraform apply