Homelab overhaul: standing up the new server
Standing it up, and walking into EVEN MORE PROBLEMS. You might notice a recurring theme
This is a continuation of my woes in overhauling my homelab.
The server, and requirements
My server is a HP DL380P Gen9 that I acquired shortly after its end-of-support date, likely from a Bosch datacenter1 that wanted to get rid of them. Provisionally, I will designate this machine with the hostname boop, because I don’t have any better names for it.
I had the following requirements for this server:
- RAID 1 NVME root partition
- Link aggregation
- Fully-encrypted drives
- Decryption by logging in during initrd phase
The NVME drives
To have RAID 1, you need at least 2 drives. So I purchased a pair of Silicon Power 1TB NVME drives. I thought, “hey, in order to save PCIE slots on the server, what if I just did bifurcation?”
I looked it up, and it turns out my server does support bifurcation! So I got this kind of adapter to put the server in:
I plugged it in, and after all the other crap I did on 3/23, I went into the server’s BIOS settings… and it turns out it only supports turning the x16 PCIE into a pair of x8’s.
Well, guess that didn’t work. I’ll have to occupy two whole PCIE slots 😭
Those arrived on 3/24, and I installed them and proceeded with my NixOS. I’m sure I will probably find a use for the bifurcation card at some point, just on a different machine.
Bonding over pain
boop was connected to the Brocade switch like so:
- iLO + eno1 went to VLAN 69, the management VLAN. This single ethernet is the management point for the server.
- the rest of the ethernets were connected to non-69 ports on the switch. These will be aggregated.
After booting up boop via USB, I thought to try setting up link aggregation for it. To experiment, instead of setting it up via NixOS configs, I set it up via the ip command.
I had the switch configured to output untagged packets, and have LACP on the three ports (25, 26, 27), something like this:
vlan 100 name prod by port
tagged ethernet 1/1/1
untagged ethernet 1/1/25 to 1/1/27
exit
interface ethernet 1/1/25 to 1/1/27
link-aggregate configure timeout short
link-aggregate configure key 10001
link-aggregate active
exit
Then, I ran a series of commands that looked something like this:
ip l add bond007 type bond mode 802.3ad lacp_rate fast lacp_active on
for i in eno2 eno3 eno4; do ip l set $i master bond007; done
ip l set bond007 up
This got DHCP all fine! But then when I tried pinging the firewall, it didn’t work. I was getting timeouts.
Capturing packets on the firewall, I saw packets, and it was sending replies back. Even running tcpdump into wireshark on the server, I saw those packets coming back. So I had no idea why ping wasn’t seeing them, especially since I somehow got DHCP working over it!
I eventually ended up flipping the order – instead of doing LACP over untagged packets, I did VLAN tagging over LACP.
vlan 100 name prod by port
tagged ethernet 1/1/1
tagged ethernet 1/1/25 to 1/1/27
exit
interface ethernet 1/1/25 to 1/1/27
link-aggregate configure timeout short
link-aggregate configure key 10001
link-aggregate active
exit
ip l add bond007 type bond mode 802.3ad lacp_rate fast lacp_active on
for i in eno2 eno3 eno4; do ip l set $i master bond007; ip l set $i up; done
ip l add link bond007 name bond007.100 type vlan id 100
Pinging on interface bond007.100 actually worked. I guess this switch really does not like LACP with untagged ports, though perhaps I just set it up wrong. This is fine, though; it’s probably for the better, since the hypervisor should be able to do this anyways.
In my NixOS configuration, I will set this up using systemd-networkd, though I haven’t done that yet.
Decryption in initrd using SSH
I set up the root partition to be ZFS as I usually do. But, since this is a server, I can’t just walk up to the machine and type in my password, that would be inconvenient. So, I needed to set up SSH during initrd.
NixOS does have that option. However, getting it to work was quite annoying. Mostly, because they said that the host keys “are stored insecurely in the global Nix store”, I thought that it would be fine to just have private keys in the repo if they’re going to be public anyways, something like this:
{
boot.initrd.network.ssh = {
enable = true;
port = 2222; # because we are using a different host key
hostKeys = [
./initrd/ssh_host_rsa_key
./initrd/ssh_host_ed25519_key
];
authorizedKeys = inputs.self.lib.sshKeyDatabase.users.astrid;
};
}
Apparently not, because those are actually filepaths, rather than derivations. This would work, though:
{
boot.initrd.network.ssh = {
enable = true;
port = 2222; # because we are using a different host key
hostKeys = [
(pkgs.writeText "ssh_host_rsa_key"
(builtins.readFile ./initrd/ssh_host_rsa_key))
(pkgs.writeText "ssh_host_ed25519_key"
(builtins.readFile ./initrd/ssh_host_ed25519_key))
];
authorizedKeys = inputs.self.lib.sshKeyDatabase.users.astrid;
};
}
You would not believe how much pain I had to deal with because I thought
nixos-install worked successfully, but it actually just failed because it
couldn’t find /nix/store/blablabla-source/whatever/initrd/ssh_host_rsa_key
and
kept going through. It was late at night and I was not reading the logs very
carefully.
Of course, this is not ideal – now the SSH keys are not merely in plaintext, but now in my public git repo. Well, it also turns out that I misread the documentation in my sleepiness again – that “stored insecurely in global Nix store” part was qualified by “Unless your bootloader supports initrd secrets,” and my bootloader appears to indeed support initrd secrets.
Probably I should just use what they use as an example and generate machine-local keys:
[
"/etc/secrets/initrd/ssh_host_rsa_key"
"/etc/secrets/initrd/ssh_host_ed25519_key"
]
When I rebooted the machine… nothing happened. This is because I had to not
only boot.initrd.ssh.network.enable
, but a couple other things:
{
boot.initrd.network = {
enable = true;
udhcpc.enable = true;
};
}
After I rebooted, udhcpcd was not able to find any network interfaces. Turns
out, that was because I needed to add the kernel module tg3
to initrd, because
I’m using HP devices.
Rebooting again, I actually did get IP addresses and a shell!
> ssh root@192.168.69.206 -p 2222
~ # zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 2.94G 920G 25K none
rpool/enc 12.5M 920G 165K none
rpool/enc/etc 668K 920G 668K legacy
rpool/enc/home 854K 920G 854K legacy
rpool/enc/tmp 145K 920G 145K legacy
rpool/enc/var 10.7M 920G 10.7M legacy
rpool/nix 2.92G 920G 2.92G legacy
I just had to run zfs load-key
.
~ # zfs load-key rpool/enc
Enter passphrase for 'rpool/enc':
~ #
… and nothing happened. Now what’s happening?
The answer is, after running zfs load-key
myself, the boot process was still
blocked, because it was calling zfs load-key
too, just on the graphical
output.
~ # ps | grep zfs
962 root 0:00 zfs load-key -a
983 root 0:00 grep zfs
The solution to this is extremely easy :)
~ # kill 962
After this, I now have a blank system, open for me to set up whatever deranged things I want!
Conclusion
reading comprehension degrades when you are sleep-deprived
-
Based on the original iLO configurations, which had Bosch written all over them. ↩