Hours of bug hunting only to find you’ve forgotten the obvious; Something talking to a rubber duckie could have solved. A few months ago, through unfortunate personal experience, I was reminded of this classic programming blunder and decided it deserved a write-up.
I spent January of 2021 setting up my homelab. The pandemic had accelerated my plans and I was more than happy to jump into the deep end. This included configuring a Proxmox hypervisor cluster on some Dell Poweredge servers. Included in my planned configuration was a GPU passthrough to one of the VMs. This would allow me to use and develop GPU accelerated programs remotely with possible applications in machine learning and data science. This proved to be one of the larger stumbling blocks in the configuration. I had little experience with hypervisors and Proxmox and turned to youtube to be my guide.
I followed my selected guide verbatim and was left disappointed. Convinced I did something wrong I tried again. Failing for the third time I nuked the Proxmox install and started from square one. Yet again I was left disappointed. I found a few more guides on Proxmox’s forums and Reddit, all of which failed. At this point, I started getting frustrated. I did everything the guides said. My GRUB config was exactly what it should have been. All the kernel modules were loading. Dmesg wasn’t throwing any related errors. I decided I’d forsake the guide and troubleshoot my way through the issue. After finding no obvious problems I started drafting a thread for Proxmox’s forum. I spent hours downloading packages and diagnostic tools, skimming and referencing documentation, updating firmware, and configuring GRUB. In the end, I narrowed down the problem, Proxmox wouldn’t detect the IOMMU. In my drafted forum post I documented everything that helped me come to this conclusion as well as all my attempted solutions. I knew many of the questions and diagnostics respondents to the thread would ask so I recorded and appended them to the draft in advance.
At this point, I’d been working on the issue for over ten hours and resolved myself for one more attempt before going to sleep. For my last attempt I decided I’d run through the install and configuration in person rather than remote. I walked down to the server rack and plugged in a monitor. I wanted to make sure I covered all my bases. I plugged in a mouse and keyboard and hit the power button. The sound of a jet engine erupted in the small room and the bootloader flashed on the screen. I froze. The bootloader the monitor was showing didn’t look like GRUB. Indeed, I knew this bootloader very well. It was systemd-boot. The entire time I had been editing the kernel parameters assuming my Proxmox install booted off GRUB. All the guides and documentation I had read showed GRUB being the bootloader but here I was looking at a systemd-boot screen. Through my troubleshooting process, I skimmed and referenced the official documentation many times but had yet to read it cover to cover. I pulled it up, and there, almost like a footnote, was the configuration option for systemd-boot. All of GRUB's config files were on the distribution but were not being used. For whatever reason Proxmox installed with systemd-boot (I’m still not sure why). For posterity, I finished my forum post with the solution and posted it, marking it as solved.
During this experience I (re)learned three things:
Read official documentation first
If I had read, cover to cover, the official documentation early on I would have known to check for a systemd-boot bootloader.
Don’t trust 3rd party documentation
I spent a lot of time assuming the bootloader was GRUB because I found no information to the contrary. Every resource I looked at used GRUB. It wasn’t till I looked at the official documentation that I knew Proxmox even had systemd-boot installed.
RTFM, don’t just skim
During my troubleshooting process I was often referencing the documentation. I’m sure the answer was in front of my eyes several times but because I was so focused on troubleshooting steps from the forum or guides I failed to read the documentation cover to cover.
During this experience I found myself practicing:
I was rubberducking, both with the forum post I was drafting, but also with my small houseplant (now deceased). I found myself talking through my problem and that helped me narrow it down to the IOMMU.
Documenting your solution
I don’t think anyone will have quite the same journey I had but I still think it proper to record my solution on Proxmox’s forum, including all the steps I took to trouble shoot.
Solution: Check which bootloader you’re using.