This isn’t strictly a packer issue, but I’ll ask anyway, maybe someone he’s come across this issue.
I’m using vsphere with ova Ubuntu templates to provision virtual machines (Ubuntu 22.04 LTS (Jammy Jellyfish) daily [20230615]). With these templates the virtual machines reboots after the cloudinit changes have been applied, so there is an intermediary stage where the virtual machine boots up with (I’m guessing) all the normal services, among which the ssh server also.
This is, of course, identified by packer and packer tries to connect through SSH to the virtual machine and run ansible before the VM reboots, which leads to:
Failed to connect to the host via ssh: ssh: connect to host 10.0.0.1 port 22: Connection refused
I’m not sure what has changed in this latest version of the Ubuntu cloud image (2023-06-13), maybe the ssh server is starting in this intermediary stage when it shouldn’t, I’ll have to check that.
I was wondering if you had any ideas about how I could go about solving this issue.
I’m using packer version 1.7.9 and ansible version 2.14.6
The discussion around the bug has become stale in the meantime. At this point I’ve no idea how one is supposed to be running packer with the Ubuntu template (ova) anymore
You’re supposed maybe to add a timeout which is entirely dependent on the vagaries of the hypervisor or whatever weird circumstances you might be encountering.
I’ve tried using bootcmd: systemctl stop sshd and runcmd: systemctl start sshd (cloudinit directives), but this adds a huge delay to the deploying, because ssh starts very late – somehow cloudinit seems to be delayed exactly because ssh isn’t starting.
This ensure that the command runs only once, then the virtual machine reboots and then the ssh will start.
Before I just tried stopping and starting ssh using bootcmd and runcmd (systemctl start/stop sshd), but that wouldn’t work, because bootcmd would run at each boot and because some service depended on ssh itself, or just waiting for it to start (at least that’s my take on it), runcmd would run really late, after 5 minutes or so. So that wasn’t very practical.
In case someone else comes across this, maybe this will help. I wonder a little bit why this isn’t a bigger issue given that Canonical decided to change the template like that, but for many I guess that this layer is covered by public clouds and such.