Boundary-worker.service not found after deploying boundary-reference-architecture to aws

Trying to run the boundary-reference-architecture deployment for aws, and I’ve been struggling for days.

  • I guess I was supposed to know how to configure my ~/.aws/credentials file, but I didn’t. I work with multiple aws instances and terraform wasn’t hitting the one I wanted. If there is documentation about getting that right, I haven’t seen it. I got that working but wasted a lot of time getting there.
  • I had a problem with line endings when I cloned the repo to my Windows 10 machine (detailed here). Again, it’s good now but took a while.
  • I was using the Windows boundary.exe file instead of the Linux binary (that same issue actually came up in the thread mentioned above, but of course I only found that thread after solving it myself).
  • When terraform apply -target module.aws completes, I get an error that the ACM Certificate is valid in the future. It also fails to create the Load Balancer, but I assume that is because of the certificate failure. Re-running terraform apply solves both of those.

So now it completes successfully, but it doesn’t look like everything worked. The boundary-controller service is up and running (although the output from the install.sh script reported an Unable to capture a lock on the database error, does that matter?). The boundary-worker service does not exist, even though I can see the output from it installing and there are no errors in that. If I try to manually run the install script (via ssh), I get an error, but it at least creates the service.

ubuntu@ip-x-x-x-x:~$ sudo systemctl status boundary-worker
Unit boundary-worker.service could not be found.
ubuntu@ip-x-x-x-x:~$ sudo ~/./install.sh worker
The system user `boundary' already exists. Exiting.
chown: cannot access '/etc/boundary-worker.hcl': No such file or directory
Created symlink /etc/systemd/system/multi-user.target.wants/boundary-worker.service → /etc/systemd/system/boundary-worker.service.
ubuntu@ip-x-x-x-x:~$ sudo systemctl status boundary-worker
● boundary-worker.service - boundary worker
     Loaded: loaded (/etc/systemd/system/boundary-worker.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Mon 2021-09-13 18:01:07 UTC; 16s ago
    Process: 2795 ExecStart=/usr/local/bin/boundary server -config /etc/boundary-worker.hcl (code=exited, status=3)
   Main PID: 2795 (code=exited, status=3)

Sep 13 18:01:07 ip-x-x-x-x systemd[1]: Started boundary worker.
Sep 13 18:01:07 ip-x-x-x-x boundary[2795]: Error parsing config file: open /etc/boundary-worker.hcl: no such file or directory
Sep 13 18:01:07 ip-x-x-x-x systemd[1]: boundary-worker.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Sep 13 18:01:07 ip-x-x-x-x systemd[1]: boundary-worker.service: Failed with result 'exit-code'.

I’m all out of ideas here. Can anyone help me?

Couple of things; I’ll try to take these in order:

  • This repo uses the official Terraform AWS provider to provision its AWS resources; to pass it credentials, you can either put them in ~/.aws/credentials, put them in your environment variables, or pass them explicitly in the provider config (this last item is not recommended though).
  • As you noticed, some other folks have hit the issue with line endings too. This looks like it’s an issue with how git itself handles line endings between Windows and *nix when cloning; I think I found a solution using .gitattributes that I’ll be testing tonight. Just to confirm, you are git clone'ing and not downloading the files from the repo as a ZIP file?
  • I’m looking at the idea of direct-downloading the boundary binary to the hosts in question rather than uploading it from the user in a provisioner block. I still need to think about the details of the right way to do this though.
  • I’m not sure what’s up with the ACM certificate validity period – could be clock drift between your local system and AWS?

Given the errors you had to start, I think it’s likely that various processes did not complete successfully, leaving you with no working database (boundary database init will fail if the controllers already have a lock on the DB, but they won’t ever successfully start if the DB hasn’t inited yet) and no provisioned config files on your Boundary hosts.

We need to give our Windows users a little love with some fixes in the ref-arch repo I think, and I’m working on that tonight; in the meantime, provisioning a Linux VM somewhere just to run Terraform on may be a workaround you can use to get started till those fixes get merged.

Yes, I did a git clone. I confirmed that downloading the install.sh manually gets the correct line endings, so I assume downloading the full repo as a zip would also work.

I think it should. Also you will want to download and unzip the 64-bit Linux binary somewhere, and point to it with -var boundary_bin=[the path to the folder containing the Boundary Linux binary]

I think between those two and setting your TF provider environment variables you’ll be able to get up and running with a fresh install.

I got sucked onto another project so this went to the back burner for a while, but I’m finally back to it. I set up a Linux VM as suggested to run Terraform. Most of it works, but either it still doesn’t get everything or I don’t quite understand the directions.

  • terraform apply -target module.aws looks like it runs cleanly, although controller setup still reports Unable to capture a lock on the database.
  • I can ssh to aws and see the controller service is up and running but Unit boundary-worker.service could not be found. There is no /etc/boundary-worker.hcl file present.

I can then run terraform apply to configure boundary and it seems to work. The admin console link only works via http, not https, but I found this issue that explains it so that’s not a big deal.

Should I be concerned about those errors from the controller & worker? It looks like boundary is working for me, so I’m not sure what might be missing.

num_workers defaults to 1 and num_controllers to 2, so you should get a single worker and two controllers all on different instances, with the controller instances fronted by an AWS LB. The error message about the lock on the database is normal if the controller tries to come up before the database init is finished – while DB init is running controllers will not get the database lock they need to operate, but if you get the admin console UI it means they are coming up eventually after the DB init was successful.

What does terraform state list in the main terraform directory tell you about module.aws.aws_instance.worker?

I am no longer employed in the position where I was using this, so I do not have access to the system to check that for you. Thanks for your help!

I am still facing the issue which you mentioned above with install.sh. getting these errors on both controller and worker. Any workaround?
odule.aws.aws_instance.controller[1] (remote-exec): User: ubuntu
module.aws.aws_instance.controller[1] (remote-exec): Password: false
module.aws.aws_instance.controller[1] (remote-exec): Private key: true
module.aws.aws_instance.controller[1] (remote-exec): Certificate: false
module.aws.aws_instance.controller[1] (remote-exec): SSH Agent: false
module.aws.aws_instance.controller[1] (remote-exec): Checking Host Key: false
module.aws.aws_instance.controller[1] (remote-exec): Target Platform: unix
module.aws.aws_instance.controller[0] (remote-exec): /home/ubuntu/install.sh: 3:
module.aws.aws_instance.controller[0] (remote-exec): : not found
module.aws.aws_instance.controller[0] (remote-exec): /home/ubuntu/install.sh: 7:
module.aws.aws_instance.controller[0] (remote-exec): : not found
module.aws.aws_instance.controller[1]: Still creating… [1m20s elapsed]
module.aws.aws_instance.controller[0]: Still creating… [1m20s elapsed]
module.aws.aws_instance.controller[1] (remote-exec): Connected!
module.aws.aws_instance.controller[0]: Creation complete after 1m20s [id=i-0eeb73989d4f5bca7]
module.aws.aws_instance.controller[1] (remote-exec): /home/ubuntu/install.sh: 3:
module.aws.aws_instance.controller[1] (remote-exec): : not found
module.aws.aws_instance.controller[1] (remote-exec): /home/ubuntu/install.sh: 7:
module.aws.aws_instance.controller[1] (remote-exec): : not found

directly copying the install.sh from the repo and pasting it in my visualstudio code actually worked and it removed some extra characters which were coming earlier. But i still havent been able to start my controller service

ubuntu@ip-10-0-0-5:~$ sudo systemctl status boundary-controller
● boundary-controller.service - boundary controller
Loaded: loaded (/etc/systemd/system/boundary-controller.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2023-09-17 15:35:45 UTC; 3h 1min ago
Process: 2750 ExecStart=/usr/local/bin/boundary server -config /etc/boundary-controller.hcl (code=exited, status=203/E>
Main PID: 2750 (code=exited, status=203/EXEC)

Sep 17 15:35:45 ip-10-0-0-5 systemd[1]: Started boundary controller.
Sep 17 15:35:45 ip-10-0-0-5 systemd[2750]: boundary-controller.service: Failed to execute command: Permission denied
Sep 17 15:35:45 ip-10-0-0-5 systemd[2750]: boundary-controller.service: Failed at step EXEC spawning /usr/local/bin/bounda>
Sep 17 15:35:45 ip-10-0-0-5 systemd[1]: boundary-controller.service: Main process exited, code=exited, status=203/EXEC
Sep 17 15:35:45 ip-10-0-0-5 systemd[1]: boundary-controller.service: Failed with result ‘exit-code’.

What are the permissions on the Boundary binary?

Also, the install script in the boundary-reference-architectures repo is very old and there’s no need for most of what’s in it any more. A better way to install Boundary currently is to just use apt to install it from the official HashiCorp Linux package repos.

I’ve been slowly working with this setup for some POC testing I’m carrying out where I work.
I initially grabbed the repo and wanted to deploy it to our test account but had the stoppers put on that by my colleague who handles most of our AWS bits as the “out of the box” setup would’ve trashed our Security Hub score :joy:

After a few revisions, I’ve managed to lock it down well enough to ensure it’s not open to everyone and have been testing with it.

Some little blocks I’d hit:

  • The need to transfer the boundary binary to each node was a PITA, and really slowed deployment when I didn’t have a connection with decent upload speed. I had a look into the Provisioner options and tried to grab the binary via WGET but then have the fuss of extracting etc. In the end I just copied the instructions for installing boundary via APT into a new “apt-install.sh” file and replicated the code block in the TF used to copy over and run the “install.sh” and this has worked fine - with the caveat you DO need to update the controller and worker config hcl template files as it expects the binary in /usr/local/bin and not /usr/bin where it’s installed via apt.

  • The self-signed cert that gets created isn’t valid for https connections which REQUIRE a proper cert. This is where I’m currently ironing out some kinks as I’m hoping to use OIDC auth to our OneLogin SSO provider (I previously had this working fine using the OIDC guide and Dev mode locally :))

  • There are also some warnings displayed about the “password” option being set when applying the Boundary part of the config. This however doesn’t seem to stop it applying.

  • On that note the Boundary TF provider specified is VERY OLD - I think it was asking for 1.0.4 when 1.1.10 is current version which includes numerous fixes and improvements. I don’t think I’d needed to update much code for this but I’d been using 1.1.9 in the local Dev test I was running, and had copy/pasted the code from the flat main.tf file that has you create into the various nicely named .tf files in the example infra.

Think that was about it atm, if I remember any other bits I’ll pop them in. Has been a fun period, especially as I’m not overly familiar with TF to start with but do love some “hands-on” testing and tinkering!