Replace disk in mdadm before it fails

There are 2 disks containing / and SWAP partition.

# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 136.8G 0 disk ├─sda1 8:1 0 130.4G 0 part │ └─md0 9:0 0 130.3G 0 raid1 / └─sda2 8:2 0 6.4G 0 part └─md1 9:1 0 6.4G 0 raid1 [SWAP] sdd 8:48 0 136.8G 0 disk ├─sdd1 8:49 0 130.4G 0 part │ └─md0 9:0 0 130.3G 0 raid1 / └─sdd2 8:50 0 6.4G 0 part └─md1 9:1 0 6.4G 0 raid1 [SWAP]

One disk started to act strange, utilization spiked and latency began to increase. After some investigation one thing stood out, that disk had continuously increasing Non-medium error count, by hundreds a minute. General opinion on the internet is that if that number continues to grow start looking for a replacement disk. Usually people get few hundreds of those, but we were getting those numbers per minute and we got to millions range.

# smartctl --all /dev/sdd | grep "Non-medium error count" Non-medium error count: 7564451

Compare that to another disk, and you will see the difference.

# smartctl --all /dev/sda | grep "Non-medium error count" Non-medium error count: 35

Mark the disk as failed.

# mdadm --manage /dev/md0 --fail /dev/sdd1 # mdadm --manage /dev/md1 --fail /dev/sdd2

Remove disk from configuration.

# mdadm --manage /dev/md0 --remove /dev/sdd1 # mdadm --manage /dev/md1 --remove /dev/sdd2

To be sure that we get the right disk out we need to locate it. Turn the led on, and off after we locate the drive.

# ledctl locate=/dev/sdd # ledctl locate_off=/dev/sdd

Remove physical disk and replace it with the new one. Check lsblk to see new disk.

Copy partition table from sda to sdd

# sfdisk -d /dev/sda | sfdisk /dev/sdd

Partition table is copied to the new disk.

Add new disk to raid arrays, first SWAP because it’s smaller an it will rebuild quickly.

# mdadm --manage /dev/md1 --add /dev/sdd2

After that add / and let it rebuild while you finish rest of the checkups.

# mdadm --manage /dev/md0 --add /dev/sdd1

We can see the final state is the same as when we started.

Difference is that we now have functioning disk with the same utilizationa and latency as sda, and no increase in Non-medium error count.

All this was done with zero downtime, because all the disks were in hot-swap drive bays. If the disk failed totally server would still run fine because SWAP was on raid1, which was intended configuration here.

Replace disk in mdadm before it fails

Leave a Reply Cancel reply

Categories

Recent posts

Replace disk in mdadm before it fails

Leave a Reply Cancel reply

Categories

Tags

Recent posts