Wednesday, January 25, 2012

Nerding Out: File system issues

I've been chasing down an issue where my backup software (Vembu Storegrid, who provides terrific support) would hang and become unresponsive. Their support team logged in and helped me figure out that part of the disk appears to become inresponsive in high IO (Reading and writing from the disk) situations like when doing backups. They suggested running a repair.

So I logged in and wanted to shut down things and unmount the drive to prevent corruption.

(I check to see the device name I want to unmount /data or /dev/sdb1)

[root@sys-util-1 /]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              68G   45G   21G  69% /
tmpfs                 2.0G     0  2.0G   0% /dev/shm
/dev/sdb1             2.8T  2.6T  195G  94% /data


(Now I try and unmount it, but it's showing as busy)
[root@sys-util-1 /]# umount /data
umount: /data: device is busy
umount: /data: device is busy

(So now I try and force unmount it with no luck)
[root@sys-util-1 /]# umount -f /data
umount2: Device or resource busy
umount: /data: device is busy
umount2: Device or resource busy
umount: /data: device is busy

(Next I ran an "lazy" unmount which means to unmount at the next moment it's not in use)
[root@sys-util-1 /]# umount -l /data
[root@sys-util-1 /]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2              68G   45G   21G  69% /
tmpfs                 2.0G     0  2.0G   0% /dev/shm

(Now I see that the device is unmounted and I wanted to run the repair but it's failing with the error below)
[root@sys-util-1 /]# xfs_repair /dev/sdb1
xfs_repair: /dev/sdb1 contains a mounted filesystem
fatal error -- couldn't initialize XFS library

(I decided to try a basic check first instead but the result was a fail as well)
[root@sys-util-1 /]# xfs_check /dev/sdb1
xfs_check: /dev/sdb1 contains a mounted and writable filesystem
fatal error -- couldn't initialize XFS library

(So I mounted /data back up again and ran the fuser command to find out which applications were trying to hold open connections to the drive and then I killed them and confirmed that they went peacefully)
[root@sys-util-1 /]# mount /data
[root@sys-util-1 /]# fuser -vm /dev/sdb1
                     USER        PID ACCESS COMMAND
/dev/sdb1:           root       4567 f.... nautilus
                     root       4590 f.... trashapplet
                     root       4890 ..c.. bash
[root@sys-util-1 /]# kill 4567
[root@sys-util-1 /]# kill 4590
[root@sys-util-1 /]# kill 4890
[root@sys-util-1 /]# fuser -vm /dev/sdb1

(Next I unmounted the drive again and ran the repair. We are in business.)
[root@sys-util-1 /]# umount /data
[root@sys-util-1 /]# xfs_repair /dev/sdb1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
        - agno = 6
I won't bother you with the rest but after googling around I didn't find anyone that had clearly laid out how to deal with these errors. I wanted to put something good out in the universe to hopefully help some others.