Discussion of AWS Security Group Rules for absolute management while avoiding cyclical dependencies

jakauppila · June 5, 2020, 6:26pm

This post can serve as a point of discussion for #9032 Add aws_security_group_rules resource on terraform-provider-aws.

I’ll begin by excerpting a portion of @bflad very in-depth response with a summary of the issue.

Summary

To begin, here is a summary this issue in a Terraform configuration from my understanding. Please let me know if this is incorrect. While the below only shows ingress for brevity, egress also has the same issue.

resource "aws_security_group" "a" {
  name = "a"

  ingress {
    from_port                = 22
    to_port                  = 22
    protocol                 = "tcp"
    source_security_group_id = aws_security_group.b.id
  }
}

resource "aws_security_group" "b" {
  name = "b"

  ingress {
    from_port                = 22
    to_port                  = 22
    protocol                 = "tcp"
    source_security_group_id = aws_security_group.a.id
  }
}

Effectively, the desire is to allow each of the EC2 Security Groups to cross-communicate. However, when this configuration is applied, Terraform will return a cycle error since both resources reference each other.

The current recommended guidance on this situation is to switch from using ingress/egress configuration blocks in the aws_security_group resource as shown above, to the below usage of only defining ingress and/or egress rules via the aws_security_group_rule resource (no ingress/egress configuration blocks in the aws_security_group resource):

resource "aws_security_group" "a" {
  name = "a"
}

resource "aws_security_group_rule" "a_from_b" {
  security_group_id        = aws_security_group.a.id
  type                     = "ingress"
  from_port                = 22
  to_port                  = 22
  protocol                 = "tcp"
  source_security_group_id = aws_security_group.b.id
}

resource "aws_security_group" "b" {
  name = "b"
}

resource "aws_security_group_rule" "b_from_a" {
  security_group_id        = aws_security_group.b.id
  type                     = "ingress"
  from_port                = 22
  to_port                  = 22
  protocol                 = "tcp"
  source_security_group_id = aws_security_group.a.id
}

The Problem

By splitting individual rules out into their own aws_security_group_rule resource, we lose the ability to remove any rules applied outside of the Terraform configuration.

Design Decisions

Again, snipped from @bflad’s response.

Given that background, we can hopefully lay out some of the design decisions we need to consider:

Terraform and its provider ecosystem have been generally designed with the goal of usable and reliable infrastructure provisioning being top priority. Drift detection and a subset of this problem being exclusive management of child resources is a secondary priority to the first. The pragmatic decision to previously introduce a separate aws_security_group_rule resource satisfies the “usable” goal in this situation.
There is no real precedent for how to handle this particular situation in the Terraform ecosystem, given the equally frustrating combination of atypical cyclic references and increased desire for drift detection of this particular configuration.
The confusing behavior when attempting to manage child components between multiple of these parent-child resources is a constant source of bug reports and practitioner confusion. Even with documentation warnings, it is not a good user experience that the provider developers here have much control over.
Adding a second parent resource to the mix here, while not existing anywhere else in the Terraform ecosystem (that we are familiar with), could further increase this practitioner confusion. In particular, this new resource would not provide warnings/errors if attempting to use multiple of this resource to manage the same EC2 Security Group even ignoring the original cycle problem:

# This example would introduce perpetual differences
# without Terraform providing any user interface warnings.
# Practitioners would be required to do one of the following to learn its not supported:
#  * (Re-)Read resource documentation
#  * Ask colleagues or in a forum
#  * Report a GitHub issue
resource "aws_security_group" "a" {
  name = "a"
}

resource "aws_security_group_rules" "a-ingress-ssh" {
  security_group_id = aws_security_group.a.id

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["10.0.0.0/8"]
  }

  # ... potentially others ...
}

# Potentially in another Terraform configuration, managed by some other team
resource "aws_security_group_rules" "a-ingress-https" {
  security_group_id = aws_security_group.a.id

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # ... potentially others ...
}

# This example would introduce perpetual differences
# without Terraform providing any user interface warnings.
# The ingress/egress attributes do not have Computed: true
resource "aws_security_group" "a" {
  name = "a"
}

resource "aws_security_group_rules" "a-ingress" {
  security_group_id = aws_security_group.a.id

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["10.0.0.0/8"]
  }

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # ... potentially others ...
}

# Potentially in another Terraform configuration, managed by some other team
# aws_security_group_rules.a-ingress will remove egress
# aws_security_group_rules.a-egress will try to re-add
resource "aws_security_group_rules" "a-egress" {
  security_group_id = aws_security_group.a.id

  egress {
    from_port   = 0
    to_port     = 65536
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # ... potentially others ...
}

Some additional questions may also arise: How do we tell the community about this new resource? Why is there a new resource? Which resource is correct or better? Do I have to migrate? Should I migrate? Can this be combined with existing resources? e.g.

# This example would introduce perpetual differences
# without Terraform providing any user interface warnings.
# The ingress/egress attributes do not have Computed: true
resource "aws_security_group" "a" {
  name = "a"
}

resource "aws_security_group_rules" "a-ingress" {
  security_group_id = aws_security_group.a.id

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["10.0.0.0/8"]
  }

# This example would introduce perpetual differences
# without Terraform providing any user interface warnings.
# The ingress/egress attributes do not have Computed: true
# aws_security_group_rules.a-ingress will always try to remove this rule, while this tries to add it
resource "aws_security_group" "a" {
  name = "a"

  egress {
    from_port   = 0
    to_port     = 65536
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_security_group_rules" "a-ingress" {
  security_group_id = aws_security_group.a.id

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["10.0.0.0/8"]
  }

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # ... potentially others ...
}

The existing aws_security_group resource has a large usage footprint. We would be very hesitant to make breaking changes to that resource, including the deprecation/removal of its ingress and egress attributes so there remains one canonical parent resource, unless there is no other option since it would be an equally large burden on the community to change configurations.

All these put us in a rough position with the current proposal, since there is additional burden somewhere. We would prefer to not have a single resource that operates differently than the majority of other resources. While the above configurations may seem obvious when the resources are declared next to each other, varying team structures lead to varying configuration layouts and ownership.

User Stories

I’ll break down the problems that I would love to see solved:

User Story #1

As a Terraform Practitioner
I want to holistically manage two security groups that reference each other on either their ingress or egress rules
So that any rules introduced outside of the Infrastructure as Code definition are removed upon execution

User Story #2

As a Governance, Risk Management, and Compliance Auditor
I want to know that Infrastructure as Code configurations contain and enforce the desired state of security group definitions
So that compliance is maintained

shantanugadgil · June 6, 2020, 9:19am

I have had my run in with this. There are some clean/hacky ways of doing this.

not to allow human login operators to modify SG rules (hence everything comes from TF)
if you want to allow human to edit the SG rules, configure an alert if/when such a change is made.
to work around the question “are there any rules outside of the TF definitions?”, my solutions is to keep to SG, but delete all the rules and immediately apply the TF configuration. (extreeeemly hacky, I know, but achieves end goal in an idempotent manner )

Topic		Replies	Views
Aws_security_group_rule toggles the re AWS	3	1473	July 21, 2020
Use of "aws_security_group_rule" vs inline definition. What do you prefer? AWS	0	357	October 6, 2022
What is the relationship between aws_security_group_rule and aws_security_group resources Terraform	0	492	April 14, 2020
What is best practice to maintain aws security group rules? AWS	0	790	November 19, 2020
Terraform plan tries to destroy the security group resource while adding egress rules to existing security group Terraform	0	1104	November 7, 2023