There are 2 disks containing / and SWAP partition.
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 136.8G 0 disk
├─sda1 8:1 0 130.4G 0 part
│ └─md0 9:0 0 130.3G 0 raid1 /
└─sda2 8:2 0 6.4G 0 part
└─md1 9:1 0 6.4G 0 raid1 [SWAP]
sdd 8:48 0 136.8G 0 disk
├─sdd1 8:49 0 130.4G 0 part
│ └─md0 9:0 0 130.3G 0 raid1 /
└─sdd2 8:50 0 6.4G 0 part
└─md1 9:1 0 6.4G 0 raid1 [SWAP]
One disk started to act strange, utilization spiked and latency began to increase. After some investigation one thing stood out, that disk had continuously increasing Non-medium error count, by hundreds a minute. General opinion on the internet is that if that number continues to grow start looking for a replacement disk. Usually people get few hundreds of those, but we were getting those numbers per minute and we got to millions range.
# smartctl --all /dev/sdd | grep "Non-medium error count"
Non-medium error count: 7564451
Compare that to another disk, and you will see the difference.
# smartctl --all /dev/sda | grep "Non-medium error count"
Non-medium error count: 35
Mark the disk as failed.
# mdadm --manage /dev/md0 --fail /dev/sdd1
# mdadm --manage /dev/md1 --fail /dev/sdd2
Remove disk from configuration.
# mdadm --manage /dev/md0 --remove /dev/sdd1
# mdadm --manage /dev/md1 --remove /dev/sdd2
To be sure that we get the right disk out we need to locate it. Turn the led on, and off after we locate the drive.
# ledctl locate=/dev/sdd
# ledctl locate_off=/dev/sdd
Remove physical disk and replace it with the new one. Check lsblk to see new disk.
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 136.8G 0 disk
├─sda1 8:1 0 130.4G 0 part
│ └─md0 9:0 0 130.3G 0 raid1 /
└─sda2 8:2 0 6.4G 0 part
└─md1 9:1 0 6.4G 0 raid1 [SWAP]
sdd 8:48 0 136.8G 0 disk
Copy partition table from sda to sdd
# sfdisk -d /dev/sda | sfdisk /dev/sdd
Partition table is copied to the new disk.
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 136.8G 0 disk
├─sda1 8:1 0 130.4G 0 part
│ └─md0 9:0 0 130.3G 0 raid1 /
└─sda2 8:2 0 6.4G 0 part
└─md1 9:1 0 6.4G 0 raid1 [SWAP]
sdd 8:48 0 136.8G 0 disk
├─sdd1 8:49 0 130.4G 0 part
└─sdd2 8:50 0 6.4G 0 part
Add new disk to raid arrays, first SWAP because it’s smaller an it will rebuild quickly.
# mdadm --manage /dev/md1 --add /dev/sdd2
After that add / and let it rebuild while you finish rest of the checkups.
# mdadm --manage /dev/md0 --add /dev/sdd1
We can see the final state is the same as when we started.
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 136.8G 0 disk
├─sda1 8:1 0 130.4G 0 part
│ └─md0 9:0 0 130.3G 0 raid1 /
└─sda2 8:2 0 6.4G 0 part
└─md1 9:1 0 6.4G 0 raid1 [SWAP]
sdd 8:48 0 136.8G 0 disk
├─sdd1 8:49 0 130.4G 0 part
│ └─md0 9:0 0 130.3G 0 raid1 /
└─sdd2 8:50 0 6.4G 0 part
└─md1 9:1 0 6.4G 0 raid1 [SWAP]
Difference is that we now have functioning disk with the same utilizationa and latency as sda, and no increase in Non-medium error count.
All this was done with zero downtime, because all the disks were in hot-swap drive bays. If the disk failed totally server would still run fine because SWAP was on raid1, which was intended configuration here.