You are here

Linux soft RAID hanging on boot at Mounting Root

Keywords: 

I have a Linux (Gentoo) server which has been somewhat unreliable, and suffers from frequent lockups[1]. Today, it started to hang at boot on "Mounting Root Filesystem".

I booted a recovery CD and took a look at the RAID filesystems, all using Linux's MD software RAID1. They all assembled fine, and mounted the ext3 and reiser3 filesystems without trouble. So I started to look in more detail:

On doing a query of one of the components of the root RAID, I found:

# mdadm -Q /dev/hda2
/dev/hda2: is not an md array
/dev/hda2: device 0 in 2 device mismatch raid1 /dev/md3. Use mdadm --examine for more detail.

"Mismatch!" All the others show "active" or "inactive". I look closer and note "md3" - my root is md1, /boot is md3!
What is happening is that the RAID block device notes in its superblock which md device node it is assigned to. When booting, Linux is looking for /dev/md3 to mount the root. Knowing this to an MD RAID, it examines devices and starts those that match.

In this case, I've probably made a mistake during a previous recovery and mounted / as md3, which it has remembered. So on bootup, I have two filesystems claiming to be for the root device, which is set as /dev/md3 in the LILO boot loader.

To fix this, you need to update the super block. This is done when assembling the device, so do it from a fresh boot off your recovery disk.

This is what I did:

# mdadm --assemble /dev/md3 --update=super-minor /dev/hda2 /dev/hdd2

Once done, a query shows:
# mdadm -Q /dev/hda2
/dev/hda2: is not an md array
/dev/hda2: device 0 in 2 device active raid1 /dev/md1. Use mdadm --examine for more detail.

Rebooting, the root is mounted instantly and everything works. Huzzah!

[1] Once every couple of days, and almost certainly temperature related as the environment has been getting very hot and humid at the same time. It has a hardware based watchdog which brings it back up - I do like real server hardware.. I pulled the heatsinks off the CPUs and noticed a lot of thermal transfer compound (which would be my fault) - I've wiped these down and left just a very thin film and will see how well it works now.

========

Update:
I noticed that the machine is running the disks on mdma2, rather than udma5.
So I played with the kernel options (2.6.22-r9) to try to fix that and on rebooting got the same problem again. Going back to kernel 2.6.21-r5 solved both the mounting root and UDMA issues. So I suspect the real reason behind all this is a broken kernel revision, at least with Broadcom CSB5 (Intel SDS2 board).