Iterate over two lists

chell0veck · May 21, 2024, 4:20pm

Hi team,

Please suggest how I can create a helper list that will contain some controls flow and elegantly reference it in other locals to reduce repetition.

Sample:

locals {
  glue_jobs_helper = [for job in local.spark_jobs :
    {
      full_name = "${local.name}-${job.name}-${var.default_glue_job_suffix}-${var.environment}-0000000"
    }
  ]

  glue_jobs = [for job in local.spark_jobs :
    {
      _current_index = index(local.spark_jobs, job)
      #      full_name           = "${local.name}-${job.name}-${var.default_glue_job_suffix}-${var.environment}-0000000"
      full_name           = element(local.glue_jobs_helper, index(local.spark_jobs, job)).full_name
      role_arn            = "arn"
      connections         = job.connection == "" ? [] : [job.connection]
      command_type        = job.type
      python_version      = contains(keys(job), "python_version") && job.type == "pythonshell" ? job.python_version : var.default_glue_python_version
      script_location     = "s3://${var.artifact_bucket}/${job.script_location}"
      max_concurrent_runs = try(job.args.max_concurrent_runs, "3")
      glue_version        = try(job.glue_version, try(job.sfn_args["--GlueVersion"], var.default_glue_version))
      timeout             = job.timeout
      default_arguments   = merge(var.default_glue_job_args, job.args)
      # Only jobs with AWS Glue version 3.0 and above and command type glueetl will be allowed to set ExecutionClass to FLEX.
      # For pythonshell null will be implicitly casted into STANDARD. Left for output compatibility.
      execution_class = (job.type == "glueetl" && try(job.glue_version, try(job.sfn_args["--GlueVersion"], var.default_glue_version)) >= "3.0" ?
      try(job.execution_class, var.dataproduct_glue_execution_class) : "STANDARD")
      # Do not set MaxCapacity if using WorkerType and NumberOfWorkers
      # Required when pythonshell is set.
      # For pythonshell null implicitly casts to 1. For glueetl null works as expected
      max_capacity = job.type == "pythonshell" ? try(job.max_capacity, 1) : null
      # For Glue version 2.0 or later jobs, you cannot specify a Maximum capacity.
      # Instead, you should specify a Worker type and the Number of workers.
      # Error: InvalidInputException: Worker Type is not supported for Job Command pythonshell
      worker_type = job.type == "pythonshell" ? null : try(job.worker_type, "G.1X")
      # Error: InvalidInputException: Please set both Worker Type and Number of Workers
      number_of_workers = job.type == "pythonshell" ? null : try(job.number_of_workers, 2)
      tags = merge(
        local.effective_tags,
        { "GlueJobName" = "${local.name}-${job.name}-${var.default_glue_job_suffix}-${var.environment}" },
        contains(keys(job), "max_job_price") ? { MaxJobPrice = job.max_job_price } : {}
      )
    }
  ]

    glue_jobs2 = [for idx in tolist(range(length(local.spark_jobs))) :
    {
      full_name = element(glue_jobs_helper, idx).full_name
      tags = merge(
        local.effective_tags,
        { "GlueJobName" = element(glue_jobs_helper, idx).full_name },
        contains(keys(job), "max_job_price") ? { MaxJobPrice = job.max_job_price } : {}
      )
    }
  ]

}

resource "aws_glue_job" "this" {
  count = length(local.glue_jobs)

  name              = local.glue_jobs[count.index].full_name
  role_arn          = local.glue_jobs[count.index].role_arn
  connections       = local.glue_jobs[count.index].connections
  glue_version      = local.glue_jobs[count.index].glue_version
  max_capacity      = local.glue_jobs[count.index].max_capacity
  timeout           = local.glue_jobs[count.index].timeout
  default_arguments = local.glue_jobs[count.index].default_arguments
  execution_class   = local.glue_jobs[count.index].execution_class
  worker_type       = local.glue_jobs[count.index].worker_type
  number_of_workers = local.glue_jobs[count.index].number_of_workers
  tags              = local.glue_jobs[count.index].tags

  command {
    name            = local.glue_jobs[count.index].command_type
    python_version  = local.glue_jobs[count.index].python_version
    script_location = local.glue_jobs[count.index].script_location
  }

  execution_property {
    max_concurrent_runs = local.glue_jobs[count.index].max_concurrent_runs
  }
}

At first iteration I’ve extracted all computations from resources to locals but left list data structure.

And at second iteration trying to extract duplicated parts from local. glue_jobs to local. glue_jobs_helper to reduce repetition.

Not ready to switch from list to map due to backward compatibility and don’t think improved readability with element function.

Thanks.

apparentlymart · May 21, 2024, 5:43pm

Hi @chell0veck,

Can you say more about what parts of this configuration you are hoping to improve, and specifically what you dislike about the current approach? There’s a lot going on here so I’m not sure what parts to focus on.

One detail I noticed quickly was in glue_jobs where you use the index function to populate _current_index. Since local.spark_jobs is also the source collection for the for expression, you can assume that the current index of the for expression matches the index into that list:

glue_jobs = [
  for index, job in local.spark_jobs : {
    _current_index = index
  }
]

That avoids scanning the source list again to find the index, but even that is redundant because glue_jobs elements have a one-to-one relationship with spark_jobs elements – there is no if clause in the for expression filtering anything out – so the indices of glue_jobs correlate with the indices of spark_jobs.

I don’t see any other reference to _current_index in the configuration you shared though, so I suspect I’m commenting on a part of this that isn’t relevant to your question.

chell0veck · May 22, 2024, 5:17am

Hi @apparentlymart ,

Thanks for looking.

Here I’m trying to achieve following goals:

Extract full_name to a separate collection with increasing readability. Originally full_name is compound and used in tags. Using index and element seems solves the first question, order of elements is guaranteed but definitely not improves readability.
Having some nesting, higher order locals or access enclosed scope can help to decompose some computations apart from composing input variables. It becomes more critical with time when new features needs to be activated to enterprise or legacy apps.
Another example might look like:

          max_capacity = (try(job.glue_version, try(job.sfn_args["--GlueVersion"], var.default_glue_version)) < "2.0" ?
          try(job.max_capacity, 1) : null)
          worker_type = (try(job.glue_version, try(job.sfn_args["--GlueVersion"], var.default_glue_version)) >= "2.0" ?
          try(job.worker_type, "G.1X") : null)
          number_of_workers = (try(job.glue_version, try(job.sfn_args["--GlueVersion"], var.default_glue_version)) >= "2.0" ?
          try(job.number_of_workers, 2) : null)

Pseudocode might look like:

locals {
 glue_jobs_helper = [for job in local.spark_jobs :
    {
      full_name = "${local.name}-${job.name}-${var.default_glue_job_suffix}-${var.environment}-0000000",
      legacy_type = ${condition} ? true : false
    }
  ]
glue_jobs = [for job, helper in zip(local.spark_jobs, local. glue_jobs_helper) :
    {
     full_name            = helper.full_name
      connections         = job.connection == "" ? [] : [job.connection]
      command_type        = job.type
      python_version      = contains(keys(job), "python_version") && job.type == "pythonshell" ? job.python_version : var.default_glue_python_version
      script_location     = "s3://${var.artifact_bucket}/${job.script_location}"
      max_concurrent_runs = try(job.args.max_concurrent_runs, "3")
      glue_version        = try(job.glue_version, try(job.sfn_args["--GlueVersion"], var.default_glue_version))
      timeout             = job.timeout
      execution_class     = (job.type == "glueetl" && try(job.glue_version, try(job.sfn_args["--GlueVersion"], var.default_glue_version)) >= "3.0" ?
      try(job.execution_class, var.dataproduct_glue_execution_class) : "STANDARD")
      max_capacity        = helper.legacy_type ? try(job.max_capacity, 1) : null
      worker_type         = helper.legacy_type ? null : try(job.worker_type, "G.1X")
      number_of_workers   = helper.legacy_type ? null : try(job.number_of_workers, 2)
      tags                = merge(
        local.effective_tags,
        { "GlueJobName" = helper.full_name },
        contains(keys(job), "max_job_price") ? { MaxJobPrice = job.max_job_price } : {}
      )
    }
  ]

}

Thank you

apparentlymart · May 23, 2024, 12:50am

Hi @chell0veck,

For the first part of this, related to the full_name from the “helper” value, I might write that like this:

locals {
  glue_jobs_helper = [
    for job in local.spark_jobs : {
      full_name = "${local.name}-${job.name}-${var.default_glue_job_suffix}-${var.environment}-0000000"
    }
  ]

  glue_jobs = [
    for idx, job in local.spark_jobs : {
      full_name = local.glue_jobs_helper[idx].full_name
      # ...
    }
  ]
}

Notice that the second for expression now specifies idx, job instead of just job, which means that in the value expression idx is set to the current element index. Because this for expression has the same source collection as local.glue_jobs_helper, we can assume that the indices will always match and so it’s valid to look up the helper object using the element index from the original list.

I’m not sure I follow the other parts of your message fully, but I think you are asking for ideas for how to write the definition of the attributes that have complicated rules. I’m going to use execution_class as an example since it seems like the most complicated example, and first I’m going to slightly reformat what you wrote because it’s currently very hard to understand the expression nesting:

      execution_class = (
        job.type == "glueetl" && try(job.glue_version, try(job.sfn_args["--GlueVersion"], var.default_glue_version)) >= "3.0" ?
        try(job.execution_class, var.dataproduct_glue_execution_class) :
        "STANDARD"
      )

I think my first step here would be to write an intermediate expression that normalizes all of the local.spark_jobs objects to be of the same object type – factoring out all of this inline try noise – and substitute in default values for unspecified attributes as necessary. I won’t write out the whole thing but here’s a taste using just a subset of your attributes:

locals {
  spark_jobs_norm = tolist([
    for job in local.spark_jobs : {
      max_capacity = try(job.max_capacity, 1),
      worker_type = try(job.worker_type, "G.1X"),
      glue_version = try(
        job.glue_version,
        job.sfn_args["--GlueVersion"],
        var.default_glue_version,
      )
      execution_class = try(
        job.execution_class,
        var.dataproduct_glue_execution_class,
      )
      # ...
    }
  ])
}

The value of local.spark_jobs_norm will then have all of the missing attributes fixed by inserting suitable default values, so you can assume that all elements are of the same type. Use null for any attributes that are optional but don’t have any fallback default value. I used tolist to be explicit that the result ought to be a list of objects rather than a tuple of objects, which will cause Terraform to verify that all of the elements do indeed have the same object type.

Once you’ve dealt with all of the type inconsistencies, you can write a separate expression to deal with the various other rules, such as selecting the execution_type differently depending on the job type and glue version:

      # (in this I'm assuming "job" is an element of
      # local.spark_jobs_norm, not of local.spark_jobs)
      execution_class = (
        job.type == "glueetl" && job.glue_version >= "3.0" ?
        job.execution_class :
        "STANDARD"
      )

Of course this is subjective, but I personally find it easier to read this and understand the rule is: the execution class is “STANDARD” except for a glueetl job whose glue version is greater than 3.0.

(Note also that job.glue_version >= "3.0" is risky because that’s converting both job.glue_version and "3.0" into numbers and then comparing them numerically, which assumes that the glue versions follow typical decimal notation. That might be okay, but e.g. this would not deal with version strings like “3.0-beta1”, or “3.1.2”, which cannot convert to string at all, and would treat "3.10" the same as "3.1".)

I expect I probably haven’t fully answered your question here, but I hope these ideas are useful nonetheless. I can’t give more detailed advice here because you are asking a question with a very wide scope about a system I don’t know anything about.

Topic		Replies	Views
Reusable values inside of for_each Terraform	2	1267	October 15, 2020
Produce list of strings from list of objects with experimental optional attributes Terraform	10	3672	February 21, 2022
For_each objects to list Terraform	3	32487	March 18, 2022
Assigning list variables to conditional local variables Terraform	0	127	October 4, 2023
For_each questions for simple lists Terraform	0	578	February 7, 2020

Iterate over two lists

Related topics