Two years ago, I wrote how RAID doesn’t work, as it’s unable to detect silent data corruption. We tried to see what happens if we inject data corruption and unfortunately Linux 4.16.6 wasn’t able to differentiate between hardware failures and soft failures coming from dm-integrity.
As those bugs are fixed now, let’s see how to configure Fedora or RHEL (with dracut) and Archlinux (with mkinitcpio) to automatically assemble the MD-RAID on dm-integrity volumes so that the root file system can reside on them.
Word of caution
As we’re modifying disk partitions and completely overwriting them, modifying boot-time configuration and in general performing rather complex and advanced configuration, make sure you have good and current backups of the data you care about!
First, we need to prepare partitions for the volumes that will be used by dm-integrity. There are no special requirements for them, but it’s a good idea to keep them aligned to 1MiB boundaries. That’s especially important when working with SSDs or with HDDs that have 4KiB native sectors (so called Advanced Format disks). Current versions of tools like parted do that automatically.
As most hard disks do come in 4KiB sector sizes (and future replacements are only more and more likely to be like this), we need to format the partition with a 4096B sector size.
The main feature of dm-integrity is the calculation and verification of checksums, as such we need a checksum that is fast enough. To do a quick benchmark, you may use OpenSSL:
$ openssl speed md5 sha1 sha256 sha512 ... type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes md5 165603.90k 373152.90k 664617.35k 816564.66k 871069.01k 882553.30k sha1 185971.19k 427639.36k 846229.23k 1108107.61k 1228931.51k 1233747.97k sha256 99550.05k 219782.83k 413438.72k 517541.21k 556638.21k 559961.43k sha512 64932.77k 258782.93k 459379.29k 693776.04k 799465.47k 810766.90k
(Technically, we should benchmark the kernel hash performance as that’s what we’ll use, but I didn’t find an easy way to do that. Drop me a line if you know how to do it.)
As I wrote previously, by default dm-integrity uses a crc32 checksum, it’s small and possibly used by the disks themselves to check for read errors, thus we want to use something different. I’ll use SHA-1 in the following examples. You may also have good experience with BLAKE2b-256.
Finally, as the purpose of the checksums is just to detect accidental errors, not malicious changes, we don’t need the full output of the hash function. I’ve selected 8 bytes per sector (see the previous article for reasons why).
Let’s format the partitions (in this example a 4-disk RAID 6 array; switch the sda1, sdb1, sdc1 and sdd1 to the partitions you are actually using).
integritysetup format /dev/sda1 --sector-size 4096 --tag-size 8 --integrity sha1 integritysetup format /dev/sdb1 --sector-size 4096 --tag-size 8 --integrity sha1 integritysetup format /dev/sdc1 --sector-size 4096 --tag-size 8 --integrity sha1 integritysetup format /dev/sdd1 --sector-size 4096 --tag-size 8 --integrity sha1
This process will run for few hours on a modern multi-terabyte hard-drive so I suggest running them in parallel.
Once the devices are formatted we need to open them. Note: as the integritysetup superblock doesn’t save the algorithm used to calculate the checksums, you need to specify the --integrity option every time you open the device!
integritysetup open /dev/sda1 int1 -I sha1 integritysetup open /dev/sdb1 int2 -I sha1 integritysetup open /dev/sdc1 int3 -I sha1 integritysetup open /dev/sdd1 int4 -I sha1
This will create 4 block devices in the /dev/mapper directory named int1, int2, int3 and int4 (you can use other names too, like int-sda1 or such).
Opening the devices during boot
There are no ready to use standards to mount the dm-integrity volumes on boot so we need to modify the initramfs used by kernel to mount them ourselves.
Fedora for a long time now has been using the dracut initramfs. It’s also used in RHEL-8 and CentOS 8, so the instructions for them are the same too. I’ve tested it with Fedora 31 and CentOS 8.3
Dracut uses a system of modules which automatically detect if they need to be included in the initramfs or not. After that it uses udev to detect when new devices show up and what to do with them. As such, we need to create a system of files that will automatically detect when the dm-integrity block devices show up and what to do with them (how to name the integrity device and what hash to use).
I’ve created the necessary files in the dracut-dm-integrity project on github.
As instructions in the README.md state, you need to copy the files from scripts directory to the /usr/lib/dracut/modules.d/90integrity directory. After that, edit the integrity-mount.sh to make it mount your integrity volumes. Finally, run dracut -f to include this new module in the initramfs.
Archlinux uses fairly simple system for construction of its initramfs. The modules live in the /usr/lib/initcpio/ directory and are enabled using the /etc/mkinitcpio.conf file.
You can find ready to use scripts in the mkinitcpio-dm-integrity repository.
Don’t forget to run mkinitcpio -P every time you edit /etc/mkinitcpio.conf!
After restarting the system to verify that the devices are automatically opened on startup, we can create an MD RAID array on top of them.
The one special option that is beneficial to the arrays built on top of dm-integrity is specifying the size of chunk. As when writing data to dm-integrity device both the sector with data and the sector with checksums needs to be updated, performing writes large enough to update all checksums in a sector will mean that the sector with checksums doesn’t have to be read first to modify just part of it. In our example, the checksums are 8 bytes large, which means in a 4096 B sector we fit 512 checksums. So a write needs to be a multiple of: 512 (checksums in a sector) * 4096 (sector size) = 2097152 B = 2MiB, to not cause a read-modify-write operation. This option is specified in KiB, by default it’s 512KiB.
To create the MD-RAID use the following command:
mdadm -C /dev/md0 -n 4 -l 6 --chunk 2048 --assume-clean /dev/mapper/int(we can use --assume-clean as integrtitysetup format creates volumes initialised to all zero)
Such an MD-RAID will require the usual setup of adding its settings to /etc/mdadm.conf and regenerating the initramfs again:
mdadm --detail --scan >> /etc/mdadm.confThe reduced write speed caused by journaling on dm-integrity level means that making sure that the stripe_cache_size (in this case /sys/block/md0/md/stripe_cache_size) is tuned properly is even more important. In my experience setting it to at lest 4096 is a good idea and 8192 is where I see diminishing returns.
Such MD-RAID can be used as usual, as a backing device for a file system or to put the Physical Volume of an LVM2 system. One thing to note is because we created the md-integrity volumes with 4096B sectors, it presents a block device with 4096B native sectors. From what I’ve noticed, ext4 and btrfs don’t mind getting migrated from a 512B sector disk to a 4096B sector disk. The file system that did, is XFS, it refused to mount. If you still need to mount such an fs stored on such an array, you may want to use the dm-ebs module to emulate the 512B sectors. I haven’t tried it though.
If you want to create an XFS that uses such an array (4 disks in RAID 6, with 4KiB sectors and 8 byte checksums), you can use mkfs.xfs -d su=2M,sw=2 /dev/md0 to do so. That being said, with standard MD-RAID, it will automatically detect those settings.
Update 2020-10-03: in the above mkfs.xfs command there was an error, the sw option indicates number of data disks in a stripe. As we’re using RAID 6, we loose two disks to checksums, we’re using 4 disks total, so we’re left with two disks in a stripe