You are here

Recovering Linux RAID5 with mdadm

If a Linux box has hardware troubles and you temporarily loose a disk or two on a RAID5, you might get into a state where mdadm --assemble does not work. This might happen with a controller failure, or if you have faulty cabling.
You're seeing stuff like:

mdadm: failed to run array /dev/md7: Input/output error
md: pers->run() failed

Don't panic yet!
First step, ensure you have good backups and use dd or another tool to clone the hard disks.

What you need to do is recreate the RAID. This will work in most cases to get your data back, but need to be done carefully to ensure you don't destroy it in the process.

This doesn't really matter for RAID1, as your data is always consistent on both disks - you can put one disk in and resync everything from that to any other disk.

For RAID5, the data is held across all disks. The thing to realise here is that the order of the disks in the array really matters. The trick to recreating a RAID5 and having it work is to get the order right.

Problems:
1. What if the order is not obvious?
2. Resyncing.

If you add the disks in the wrong order and start the array in a working state, it will perform an initial sync of the array. This will destroy your data as RAID5 starts to write checksum data across it.
There may be a trick with mdadm to determine the correct order, but I do not know it (yet).

You must create the array in a degraded state with a disk missing. This will allow you to mount the disk, but will not cause a resync attempt.

So, here's the scenario. There are three disks in an array, hda, hdd, hdg. One failed completely (hdg) a while ago and you had to wait for new disks to be delivered. While waiting, there was an IDE failure and another disk was lost temporarily. Oh dear, we've got a broken array.

You bring the disk back online but the array won't auto-reassemble and mdadm --assemble isn't working. So we move on to recreating.

What you will do is attempt to create the array using two disks. The other will be marked as missing (even though we now have the replacement sitting on the workbench ready).
But we don't know what order the disks belong in the array - maybe we'll get lucky and they are alphabetical, maybe we won't.

This is what we'll do:

$ mdadm --create /dev/md7 --level=5 --raid-devices=3 -f /dev/hda1 /dev/hdd1 missing
$ cat /proc/mdstat
md7 : active raid5 hda1[2] hdd1[1]
240121472 blocks level 5, 64k chunk, algorithm 2 [3/2] [_UU]

So far so good, the RAID5 is running and has not touched the data by any resync attempt. So try to mount it readonly and see what happens.

It if worked, great. Backup your data and carry on your life. If not, stop the array try another order. Treat 'missing' like any other disk and move it around also. Perhaps get out a bit of paper and work out all the possible combinations to try.

$ mdadm -S /dev/md7
$ mdadm --create /dev/md7 --level=5 --raid-devices=3 -f /dev/hda1 missing /dev/hdd1

Keep repeating this until the array successfully mounts your file system.

When it has finally worked, and you've backed up, you can add your new disk back in. The array will resync your data across all three disks (or whatever number you have) and everything will be back to normal.