Live migrating Btrfs from RAID 5/6 to RAID 10

Recently it was discovered that the RAID 5/6 implementation in Btrfs is broken, due to the fact that can miscalculate parity (which is rather important in RAID 5 and RAID 6).

So what to do with an existing setup that’s running native Btfs RAID 5/6?

Well, fortunately, this issue doesn’t affect non-parity based RAID levels such as 1 and 0 (and combinations thereof) and it also doesn’t affect a Btrfs filesystem that’s sitting on top of a standard Linux Software RAID (md) device.

So if down-time isn’t a problem, we could re-create the RAID 5/6 array using md and put Btrfs back on top and restore our data… or, thanks to Btrfs itself, we can live migrate it to RAID 10!

A few caveats though. When using RAID 10, space efficiency is reduced to 50% of your drives, no matter how many you have (this is because it’s mirrored). By comparison, with RAID 5 you lose a single drive in space, with RAID 6 it’s two, no-matter how many drives you have.

This is important to note, because a RAID 5 setup with 4 drives that is using more than 2/3rds of the total space will be too big to fit on RAID 10. Btrfs also needs space for System, Metadata and Reserves so I can’t say for sure how much space you will need for the migration, but I expect considerably more than 50%. In such cases, you may need to add more drives to the Btrfs array first, before the migration begins.

So, you will need:

  • At least 4 drives
  • An even number of drives (unless you keep one as a spare)
  • Data in use that is much less than 50% of the total provided by all drives (number of disks / 2)

Of course, you’ll have a good, tested, reliable backup or two before you start this. Right? Good.

Plug any new disks in and partition or luksFormat them if necessary. We will assume your new drive is /dev/sdg, you’re using dm-crypt and that Btrfs is mounted at /mnt. Substitute these for your actual settings.
cryptsetup luksFormat /dev/sdg
UUID="$(cryptsetup luksUUID /dev/sdg)"
echo "luks-${UUID} UUID=${UUID} none" >> /etc/crypttab
cryptsetup luksOpen luks-${UUID} /dev/sdg
btrfs device add /dev/mapper/luks-${UUID} /mnt

The migration is going to take a long time, so best to run this in a tmux or screen session.

time btrfs balance /mnt
time btrfs balance start -dconvert=raid10 -mconvert=raid10 /mnt

After this completes, check that everything has been migrated to RAID 10.
btrfs fi df /srv/data/
Data, RAID10: total=2.19TiB, used=2.18TiB
System, RAID10: total=96.00MiB, used=240.00KiB
Metadata, RAID10: total=7.22GiB, used=5.40GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

If you still see some RAID 5/6 entries, run the same migrate command and then check that everything has migrated successfully.

Now while we’re at it, let’s defragment everything.
time btrfs filesystem defragment /srv/data/ # this defrags the metadata
time btrfs filesystem defragment -r /srv/data/ # this defrags data

For good measure, let’s rebalance again without the migration (this will also take a while).
time btrfs fi balance start --full-balance /srv/data/

4 Responses to “Live migrating Btrfs from RAID 5/6 to RAID 10”

  • It’s not completely broken, it doesn’t always miscalculate parity. For the problem to happen, the parity actually has to be correct to start out with, and a data strip has to contain a problem (corruption or read error), where the Btrfs code then correctly computes from parity, fixes the problem with bad data strip, but then goes on to wrongly (and unnecessarily) recompute parity and write that also. At this point the user hasn’t actually experienced any direct form of corruption. But the problem is they have lost redundancy because that parity strip is wrong; so if there’s another loss in that stripe, the parity will mean bad reconstruction, which fortunately isn’t silent on Btrfs it will warn that the resulting reconstruction is wrong.

    Also the 2nd balance doesn’t seem to be necessary, but if you really want to do that, you’re better off efficiency wise doing the file defragmentation before the balance. Balance has the effect of packing block groups 100% full of extents, where defragment consolidates extents into fewer and larger extents.

  • Thanks for the comment, I’ll make some changes. Yeah, I agree it’s not completely broken in that it doesn’t work at all, but I don’t expect my RAID array to have silent parity corruption. Even though rebuilding it might warn on Btrfs, I expect that the corruption means I can’t actually re-construct an array and would have to re-create it and restore the data manually. That’s what I mean by the implementation being broken.

  • What happened in my testing is the scrub reports the file(s) affected by the corruption, and continues on with the rest of the rebuild.

    But I tested only the corruption of a data strip, not a metadata strip. In the metadata case, it could be fatal depending on what sort of metadata is missing. The loss of a 64KiB strip could mean the loss of 4 default sized nodes or leaves, each of which could contain references for dozens of files and we wouldn’t even know what or where they were.

    Anyway it’s a bad bug, but I’d argue it’s less bad because it’d be far less common than this long standing bug which affects a lot more users (not just Btrfs, and not just raid):

  • Thanks Chris, I expect that even though Btrfs continues on with the rebuild, you’ve still got some corrupt file(s) there (could be lots, could be few)? That doesn’t seem ideal, although I guess one could replace those files from a known good backup. That other issue is interesting, thanks for posting the link.

Leave a Reply