Hey
We’re using Sagemaker to run some ML jobs. One thing we noticed when using the aws_sagemaker_endpoint
and aws_sagemaker_endpoint_configuration
resources is that the endpoint will not update if we change the underlying model in the configuration.
Our code is pretty straight forward:
resource "aws_sagemaker_endpoint_configuration" "endpoint_configuration" {
name = "${var.name}-endpoint-config"
production_variants {
variant_name = "variant-1"
model_name = "${var.model_name}"
initial_instance_count = "${var.instance_count}"
instance_type = "${var.instance_type}"
}
}
resource "aws_sagemaker_endpoint" "endpoint" {
name = "${var.name}-endpoint"
endpoint_config_name = "${aws_sagemaker_endpoint_configuration.endpoint_configuration.name}"
}
When we create a new model and pass the name to this module, terraform will destroy and create the aws_sagemaker_endpoint_configuration
but the endpoint itself does not update.
This seems quite natural and on the surface looks like everything is okay. However, when testing the endpoint it is evident from the response that the underlying model has not updated.
Furthermore, in the AWS console the Sagemaker endpoint Last Updated At
value has not changed. It’s somewhat confusing as the configuration was destroyed, and the console clearly links to the newly created configuration (I suspect because the name and ARN are the same). But as mentioned above, the Sagemaker endpoint which is running does not update.
We’ve overcame the issue by introducing the following changes:
# Random_id is used to force the sagemaker endpoint to update.
# It is only regenerated it the model_name changes from the previous state.
# Without using a new config name the sagemaker endpoint will continue to
# use the old configuration (even though the old configuration is destroyed).
resource "random_id" "force_endpoint_update" {
keepers {
model_name = "${var.model_name}"
}
byte_length = 8
}
resource "aws_sagemaker_endpoint_configuration" "endpoint_configuration" {
name = "${var.name}-endpoint-config-${random_id.force_endpoint_update.dec}"
production_variants {
variant_name = "variant-1"
model_name = "${var.model_name}"
initial_instance_count = "${var.instance_count}"
instance_type = "${var.instance_type}"
}
# By default Terraform destroys resources before creating the new one. However, in this case we want to force Terraform to create a
# new resource first. If we do not enforce the order of: Create new endpoint config -> update sagemaker endpoint -> Destroy old endpoint config
# Sagemaker will error when it tries to update from the old (destroyed) config to the new one. This has no impact on runtime or uptime,
# Sagemaker endpoints can function even if you destroy a config and do not give it a new one.
lifecycle {
create_before_destroy = true
}
}
resource "aws_sagemaker_endpoint" "endpoint" {
name = "${var.name}-endpoint"
endpoint_config_name = "${aws_sagemaker_endpoint_configuration.endpoint_configuration.name}"
}
Changing the name forces the Sagemaker endpoint to update and everything seemingly works as expected. When calling the endpoint now it is evident that the model has changed, and the LastUpdatedAt time is updated.
So finally the questions…(thanks for your patience)
- Does our updated code pose any issue when managing a SageMaker endpoint?
- Is this a Bug in the Terraform resource (should the endpoint also update if/when the configuration updates)?
- Is there a potential downside of controlling our SageMaker endpoint deployments in this manner?
- Should this be included in the documentation?
Thanks for all your help!