Yeah, I’m now dealing with IBM GPFS (General Parallel File System). One of the important skill that I need to acquire is the knowledge to perform filesystem repair by using ‘mmfsck‘ (almost similar with normal Unix’s fsck) utility.
Q: What does mmfsck do?
A: Checks and repairs a GPFS file system.
Here are the steps given by IBM GPFS support in order to check and repair GPFS filesystem.
- Determine which node is the cluster manager – in this case, it’s sgbmedprod05:
[root@sgbmedprod05 ~]# mmlsmgr -c Cluster manager node: 10.201.51.114 (sgbmedprod05)
- Unmount the filesystem on all nodes,
[root@sgbmedprod05 ~]# df -h /dcs/data31 Filesystem Size Used Avail Use% Mounted on /dev/dcsdata31 1.0T 717G 308G 70% /dcs/data31 [root@sgbmedprod05 ~]# mmumount /dcs/data31 -a Thu Sep 8 14:59:11 MYT 2011: mmumount: Unmounting file systems ...
- Run
mmfsckwith-v(for verbose output) &-n(response no to all promopts) parameter[root@sgbmedprod05 ~]# mmfsck /dev/dcsdata31 -v -n > mmfsck-dcsdata31.out 2>&1
- How to determine the filesystem is not clean?
- Look for “InodeProblemList” from the mmfsck output,
[root@sgbmedprod05 ~]# grep InodeProblemList mmfsck-dcsdata31.out InodeProblemList: 19 entries InodeProblemList: 19 entries InodeProblemList: 19 entries InodeProblemList: 19 entries
- Look for “Lost blocks” from the mmfsck output,
[root@sgbmedprod05 ~]# grep "Lost blocks" mmfsck-dcsdata31.out Lost blocks were found. Lost blocks were found. Lost blocks were found. Lost blocks were found.
- Look for “InodeProblemList” from the mmfsck output,
- Details by examining file
mmfsck-dcsdata31.out,InodeProblemList: 19 entries iNum snapId status keep delete noScan new error -------------- ---------- ------ ---- ------ ------ --- ------------------ 200274 0 1 0 0 0 1 0x00000008 AddrDuplicate 28893 0 1 0 0 0 1 0x00000008 AddrDuplicate 28985 0 1 0 0 0 1 0x00000008 AddrDuplicate 28988 0 1 0 0 0 1 0x00000008 AddrDuplicate 28906 0 1 0 0 0 1 0x00000008 AddrDuplicate 28908 0 1 0 0 0 1 0x00000008 AddrDuplicate 28909 0 1 0 0 0 1 0x00000008 AddrDuplicate 28990 0 1 0 0 0 1 0x00000008 AddrDuplicate 28996 0 1 0 0 0 1 0x00000008 AddrDuplicate 28999 0 1 0 0 0 1 0x00000008 AddrDuplicate 388503 0 1 0 0 0 1 0x00000008 AddrDuplicate 389399 0 1 0 0 0 1 0x00000008 AddrDuplicate 389400 0 1 0 0 0 1 0x00000008 AddrDuplicate 389401 0 1 0 0 0 1 0x00000008 AddrDuplicate 389402 0 1 0 0 0 1 0x00000008 AddrDuplicate 389403 0 1 0 0 0 1 0x00000008 AddrDuplicate 389404 0 1 0 0 0 1 0x00000008 AddrDuplicate 389405 0 1 0 0 0 1 0x00000008 AddrDuplicate 389406 0 1 0 0 0 1 0x00000008 AddrDuplicateLost blocks were found. Correct the allocation map? no Lost blocks were found. Correct the allocation map? no Corrections are needed in the block allocation map. Correct the allocation map? no Lost blocks were found. Correct the allocation map? no Lost blocks were found. Correct the allocation map? no
- Mount the filesystem on cluster manager node,
[root@sgbmedprod05 ~]# mmmount /dcs/data31 Thu Sep 8 15:08:56 MYT 2011: mmmount: Mounting file systems ...
- Based on mmfsck output file (
mmfsck-dcsdata31.out), I need to create a file which contains problematic inode number from InodeProblemList info. In this case, I create a file with the name ofInodeProblemList-dcsdata31and below are the contents,[root@sgbmedprod05 ~]# cat InodeProblemList-dcsdata31 200274 28893 28985 28988 28906 28908 28909 28990 28996 28999 388503 389399 389400 389401 389402 389403 389404 389405 389406
- use
tnsfindinodeto list out the files that corresponding to inodes that having problem,[root@sgbmedprod05 ~]# tsfindinode -i InodeProblemList-dcsdata31 /dcs/data31 28893 /dcs/data31/TEST/InPortal/NSN_IN/20110818/cat.S1KEP01.20110818005538.20110818010038.0201.old 389404 /dcs/data31/TEST/InPortal/NSN_IN/20110822/cft.S3NIL02.20110822052145.20110822052516.0648.old 388503 /dcs/data31/TEST/InPortal/NSN_IN/20110822/cft.S3NIL01.20110822081330.20110822081510.0862.old 28985 /dcs/data31/TEST/InPortal/NSN_IN/20110818/cat.S1KEP01.20110818052538.20110818053038.0595.old 28999 /dcs/data31/TEST/InPortal/NSN_IN/20110817/cat.S1KEP01.20110817120603.20110817121054.1046.Z 28906 /dcs/data31/TEST/InPortal/NSN_IN/20110817/cat.S1KEP01.20110817092609.20110817093109.0805.Z 28996 /dcs/data31/TEST/InPortal/NSN_IN/20110817/cat.S1KEP01.20110817120109.20110817120603.1044.Z 28909 /dcs/data31/TEST/InPortal/NSN_IN/20110817/cat.S1KEP01.20110817093609.20110817094109.0809.Z 389400 /dcs/data31/TEST/InPortal/NSN_IN/20110822/cft.S3NIL02.20110822050648.20110822051044.0634.old 389403 /dcs/data31/TEST/InPortal/NSN_IN/20110822/cft.S3NIL02.20110822051817.20110822052145.0646.old 389399 /dcs/data31/TEST/InPortal/NSN_IN/20110822/cft.S3NIL02.20110822050232.20110822050648.0632.old 389406 /dcs/data31/TEST/InPortal/NSN_IN/20110822/cft.S3NIL02.20110822052853.20110822053235.0659.old 28908 /dcs/data31/TEST/InPortal/NSN_IN/20110817/cat.S1KEP01.20110817093109.20110817093609.0807.Z 28990 /dcs/data31/TEST/InPortal/NSN_IN/20110818/cat.S1KEP01.20110818054038.20110818054538.0608.old 389401 /dcs/data31/TEST/InPortal/NSN_IN/20110822/cft.S3NIL02.20110822051044.20110822051429.0636.old 28988 /dcs/data31/TEST/InPortal/NSN_IN/20110818/cat.S1KEP01.20110818053538.20110818054038.0599.old 389405 /dcs/data31/TEST/InPortal/NSN_IN/20110822/cft.S3NIL02.20110822052516.20110822052853.0650.old 389402 /dcs/data31/TEST/InPortal/NSN_IN/20110822/cft.S3NIL02.20110822051429.20110822051817.0644.old 200274 /dcs/data31/TEST/InPortal/NSN_IN/20110705/cat.S1KEP01.20110705015314.20110705015814.0413.oldAccording to to IBM GPFS support, above files are no longer usable – in simple words, they can be considered as corrupted. He advised me to copy those files to other place, then delete them. But since they are no longer usable and must not be restored, why I have to bother to copy them?
- To double confirm that I have deleted all those files, just run
tnsfindinodeagain,[root@sgbmedprod05 ~]# tsfindinode -i InodeProblemList-dcsdata31 /dcs/data31 28893 (notfound) 28906 (notfound) 28908 (notfound) 28909 (notfound) 28985 (notfound) 28988 (notfound) 28990 (notfound) 28996 (notfound) 28999 (notfound) 200274 (notfound) 388503 (notfound) 389399 (notfound) 389400 (notfound) 389401 (notfound) 389402 (notfound) 389403 (notfound) 389404 (notfound) 389405 (notfound) 389406 (notfound) - Unmount the filesystem again
[root@sgbmedprod05 ~]# mmumount /dcs/data31 -a Thu Sep 8 15:17:42 MYT 2011: mmumount: Unmounting file systems ...
- Perform mmfsck again, this time use -y (yes response to all prompts) parameter
[root@sgbmedprod05 ~]# mmfsck /dev/dcsdata31 -v -y > mmfsck-dcsdata31-repair.out 2&>1
- Examine the mmfsck output file and I get this,
File system is clean.Yay, it means the filesystem is now healthy and I can remount the filesystem back,
[root@sgbmedprod05 ~]# mmmount /dcs/data31 -a Thu Sep 8 15:22:58 MYT 2011: mmmount: Mounting file systems ...
Some interesting experience to share,
-
Can’t delete the files that corresponding to problematic inodes,
[root@sgbmedprod05 ~]# tsfindinode -i InodeProblemList-dcsdata02 /dcs/data02 350584 /dcs/data02/DATA_6.0/dcs/opr/N/NSN_ASC_TRN/ASC_OUT/09 256133 /dcs/data02/DATA_6.0/dcs/opr/N/NSN_ASC_TRN/ASC_OUT/03 454704 /dcs/data02/DATA_6.0/dcs/opr/N/NSN_DEC_TRN/DEC_OUT/05 [root@sgbmedprod05 ~]# rm /dcs/data02/DATA_6.0/dcs/opr/N/NSN_ASC_TRN/ASC_OUT/09 rm: cannot remove directory `/dcs/data02/DATA_6.0/dcs/opr/N/NSN_ASC_TRN/ASC_OUT/09': Is a directory [root@sgbmedprod05 ~]# rm -rf /dcs/data02/DATA_6.0/dcs/opr/N/NSN_ASC_TRN/ASC_OUT/09 rm: cannot lstat `/dcs/data02/DATA_6.0/dcs/opr/N/NSN_ASC_TRN/ASC_OUT/09/NSN_IN_IN3CFT_347911.0102': Input/output error [root@sgbmedprod05 ~]# rm -rf /dcs/data02/DATA_6.0/dcs/opr/N/NSN_ASC_TRN/ASC_OUT/03 rm: cannot lstat `/dcs/data02/DATA_6.0/dcs/opr/N/NSN_ASC_TRN/ASC_OUT/03/NSN_IN_IN1CFT_327423.0102': Input/output error [root@sgbmedprod05 ~]# rm -rf /dcs/data02/DATA_6.0/dcs/opr/N/NSN_DEC_TRN/DEC_OUT/05 rm: cannot lstat `/dcs/data02/DATA_6.0/dcs/opr/N/NSN_DEC_TRN/DEC_OUT/05/NSN_IN_IN3CAT_192556.0103': Input/output errorInodeProblemList info from
mmfsckoutput states the error as DirEntryBad (instead of AddrDuplicate like previous example)InodeProblemList: 3 entries iNum snapId status keep delete noScan new error -------------- ---------- ------ ---- ------ ------ --- ------------------ 256133 0 1 0 0 0 1 0x00010000 DirEntryBad 350584 0 1 0 0 0 1 0x00010000 DirEntryBad 454704 0 1 0 0 0 1 0x00010000 DirEntryBadSolution:
Run mmfsck -v -y, then the files should be able to deleted. Run mmfsck -v -y again so that the filesystem is really clean. -
What is the symptom of problematic filesystem?
I used to experience an incident whereby a filesystem was suddenly unmounted on 1 node (there rest of the nodes are ok). This was what I got from syslog,Aug 22 12:53:54 sgbmedprod08 mmfs: Error=MMFS_SYSTEM_UNMOUNT, ID=0xC954F85D, Tag=4765838: Unrecoverable file system operation error. Status code 16. Volume dcsdata04 Aug 22 13:51:16 sgbmedprod08 mmfs: Error=MMFS_SYSTEM_UNMOUNT, ID=0xC954F85D, Tag=4765839: Unrecoverable file system operation error. Status code 16. Volume dcsappl01
————-
Aug 24 11:40:25 sgbmedprod08 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=5534377: Invalid disk data structure. Error code 107. Volume dcsdata02 . Sense Data Aug 24 11:40:25 sgbmedprod08 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=5534377: 6B 00 10 00 00 00 01 00 Aug 24 11:40:25 sgbmedprod08 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=5534377: 00 00 00 64 85 02 00 00 Aug 24 11:40:25 sgbmedprod08 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=5534377: 00 00 20 00 00 00 43 01 Aug 24 11:40:25 sgbmedprod08 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=5534377: 00 00 00 00 00 00 60 01 Aug 24 11:40:25 sgbmedprod08 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=5534377: 00 00 00 00 00 00 00 00 Aug 24 11:40:25 sgbmedprod08 last message repeated 14 timesAnd GPFS log gave more clue,
Mon Aug 22 12:53:54.479 2011: File System dcsdata04 unmounted by the system with return code 16 reason code 0 Mon Aug 22 12:53:54.480 2011: Device or resource busy Mon Aug 22 12:53:54 MYT 2011: mmcommon preunmount invoked. File system: dcsdata04 Reason: SGPanic
After logging a case to IBM GPFS support, they advise me to perform mmfsck as per example.
{ 2 } Comments
a g33k post after quite a while….
UsingCari makan bro.
UsingPost a Comment