RAID with ZFS and NixOS #
Lack of redundancy on the root filesystem has been bugging me for a while. I've migrated my main machine to use ZFS almost 1.5 years ago (at the time of writing this post) but didn't set up any RAID. Partially it was because back then I didn't think there was much to lose if the SSD goes out, but also because the original ZFS setup was pretty intimidating as it is.
The key difficulty with ZFS was that you couldn't use native ZFS-level encryption, and instead you had to partition the drive and use LUKS to encrypt the partitions. Then you had to set up ZFS on top of those encrypted partitions. I remember as part of that process it was also necessary to set up a master key partition which is then used to unlock everything else. This led to so many steps that I probably wasted at least a couple hours to port everything to ZFS on my desktop and laptop (including taking and restoring a backup).
Some time ago, a bug with ZFS native encryption that led to data corruption has been fixed so this prompted me to reattempt the RAID setup again.
Hardware #
I contemplated for a bit whether or not I should get hard drives for increased capacity, but eventually decided against it. I already had a 2TB SSD installed, and the motherboard had slots for 4 M.2 SSDs, so a total of 8TB with 6TB usable space (RAIDz) seemed reasonable.
For storage I got 3x additional Crucial T710 2TB drives, because they seem to have good overall consumer reviews. The existing SSD is Samsung-branded and is probably somewhat slower (Gen4 vs Gen5), but it will do for now.
Some background on redundant boot #
One problem that I always knew existed, and had no idea how to solve is protecting against the failure of the boot partition. Hardware RAIDs have a certain edge here: they present the array as if it was a single physical storage device, and if any drive fails, you can still boot your system normally in degraded state.
With software RAID, BIOS/EFI doesn't see the drives as one device, so it will treat them as completely independent. This means that a boot partition which contains kernels and initrd will need to be replicated through other means (not RAID).
Looks like there are at least two solutions:
Either set up a periodic rsync between multiple boot partitions
Or use
boot.loader.grub.mirroredBoots = [...]which is natively supported in NixOSBy default NixOS uses systemd-boot and not grub, but supports both well enough. So for reducing the amount of headache the transition from default is worth it.
Partitioning the drives #
For all the SSDs, I repeated the commands to partition them in the same way: a 1GB boot partition and the rest - for the ZFS volume. I don't use swap, because it can't be placed on ZFS and I don't want to use LUKS for encryption. Plus I have plenty of RAM so that it's not an issue.
# 1TB boot partition
sudo sgdisk -n 1:0:+1G -t 1:EF00 -c 1:"EFI System" /dev/nvme1n1
# Rest of the space for ZFS
sudo sgdisk -n 2:0:0 -t 2:BF01 -c 2:"ZFS" /dev/nvme1n1
# Formatting the boot partition
sudo mkfs.vfat -F32 -n EFI /dev/nvme1n1p1
Repeat the same with /dev/nvme2n1p2 and /dev/nvme3n1p2.
Setting up ZFS pool #
This is where I create the actual RAID and allocate all the space to one large pool. Most important flags here are:
ashift[1]: sets the block size. Typically SSDs have larger block sizes so this should be at least 12, which will lead to 4KB blocks (2^12=4096).compression: I find that the cost of compressing/decompressing the data is not too high, and the benefit is pretty material on NixOS.keyformat: setting this topassphrasewill allow to unlock the pool with a passphrase on boot.zpool create \ -o ashift=12 \ -o autotrim=on \ -O acltype=posixacl \ -O dnodesize=auto \ -O compression=lz4 \ -O encryption=aes-256-gcm \ -O keyformat=passphrase \ -O xattr=sa \ -O normalization=formD \ -O mountpoint=none \ zmain raidz /dev/nvme1n1p2 /dev/nvme2n1p2 /dev/nvme3n1p2
Creating datasets #
Datasets in ZFS aren't "filesystems", but you can think of them as a logical subdivision of the same storage. It is convenient to have them separate, as you can mount these datasets as you would normally mount a separate partition. And you also will be able to take snapshots from them separately.
sudo zfs create -o mountpoint=legacy zmain/root
sudo zfs create -o mountpoint=legacy zmain/home
sudo zfs create -o mountpoint=legacy zmain/nix
sudo zfs create -o mountpoint=legacy zmain/var
Migrating existing ZFS datasets to new filesystem #
As I already had an existing ZFS setup, it was possible to migrate the datasets directly, by using zfs send and zfs receive. First, I snapshotted the existing datasets:
# zroot is my old pool, and zmain is the new one
sudo zfs snapshot zroot/root@migrate
sudo zfs snapshot zroot/home@migrate
sudo zfs snapshot zroot/nix@migrate
sudo zfs snapshot zroot/var@migrate
And then migrated them one by one (repeat for all datasets):
sudo zfs send zroot/root@migrate | sudo zfs receive -o mountpoint=legacy -x encryption -x keylocation -x keyformat zmain/root
Enabling mirrored boot #
As I said above, setting up mirrored boot is pretty simple in NixOS if you use grub:
{
boot.loader.grub = {
enable = true;
efiSupport = true;
efiInstallAsRemovable = false;
devices = [ "nodev" ];
# These are all the drives that would be mirrored. As far as I understand
# it, the "mirroring" is done during the installation of new kernels
# or configuration changes.
mirroredBoots = [
{
devices = [ "/dev/nvme1n1" ];
path = "/boot";
}
{
devices = [ "/dev/nvme2n1" ];
path = "/boot-2";
}
{
devices = [ "/dev/nvme3n1" ];
path = "/boot-3";
}
];
};
}
Rebuilding #
I was logged in to my NixOS system while migrating the data, and wanted to do it in a non-destructive way. This is such that if anything goes wrong with the new setup, I can safely boot back to the first drive I was migrating from.
So in order to apply the new configuration changes, I used chroot to "enter" the newly copied system and run NixOS configuration switch from there, so that it installs the bootloader to the new volumes.
Wiping the old drive and adding it to the pool #
After verifying that I can boot to the new pool, I needed to wipe the old drive, repartition according to the new scheme and add it to the pool:
sudo sgdisk --zap-all /dev/nvme0n1
# Same as with the rest of the disks
sudo sgdisk -n 1:0:+1G -t 1:EF00 -c 1:"EFI System" /dev/nvme0n1
sudo sgdisk -n 2:0:0 -t 2:BF01 -c 2:"ZFS" /dev/nvme0n1
sudo mkfs.vfat -F32 -n EFI /dev/nvme0n1p1
# And attach to the existing raidz
sudo zpool attach zmain raidz1-0 /dev/nvme0n1p2
And after about 5 minutes of waiting, the RAID has been resilvered without any issues.
Further work #
With the redundancy setup I feel much better about the safety of my data. However, there are a few more things to do to put the cherry on top: