wordpress y2014 m10 d22 h12 single s-y2011 s-m09 s-d09 s-h01 s-category-ibm s-author-irwan">
Skip to content

IBM GPFS: Performing File System Consistency Check

Yeah, I’m now dealing with IBM GPFS (General Parallel File System). One of the important skill that I need to acquire is the knowledge to perform filesystem repair by using ‘mmfsck‘ (almost similar with normal Unix’s fsck) utility.

Q: What does mmfsck do?
A: Checks and repairs a GPFS file system.

Here are the steps given by IBM GPFS support in order to check and repair GPFS filesystem.

  • Determine which node is the cluster manager – in this case, it’s sgbmedprod05:
    [root@sgbmedprod05 ~]# mmlsmgr -c
    Cluster manager node: 10.201.51.114 (sgbmedprod05)
  • Unmount the filesystem on all nodes,
    [root@sgbmedprod05 ~]# df -h /dcs/data31
    Filesystem            Size  Used Avail Use% Mounted on
    /dev/dcsdata31        1.0T  717G  308G  70% /dcs/data31
    [root@sgbmedprod05 ~]# mmumount /dcs/data31 -a
    Thu Sep  8 14:59:11 MYT 2011: mmumount: Unmounting file systems ...
  • Run mmfsck with -v (for verbose output) & -n (response no to all promopts) parameter
    [root@sgbmedprod05 ~]# mmfsck /dev/dcsdata31 -v -n > mmfsck-dcsdata31.out 2>&1
  • How to determine the filesystem is not clean?
    • Look for “InodeProblemList” from the mmfsck output,
      [root@sgbmedprod05 ~]# grep InodeProblemList mmfsck-dcsdata31.out
      InodeProblemList: 19 entries
      InodeProblemList: 19 entries
      InodeProblemList: 19 entries
      InodeProblemList: 19 entries
    • Look for “Lost blocks” from the mmfsck output,
      [root@sgbmedprod05 ~]# grep "Lost blocks" mmfsck-dcsdata31.out
      Lost blocks were found.
      Lost blocks were found.
      Lost blocks were found.
      Lost blocks were found.
  • Details by examining file mmfsck-dcsdata31.out,
    InodeProblemList: 19 entries
    iNum           snapId     status keep delete noScan new error
    -------------- ---------- ------ ---- ------ ------ --- ------------------
           200274          0      1    0      0      0   1 0x00000008 AddrDuplicate
            28893          0      1    0      0      0   1 0x00000008 AddrDuplicate
            28985          0      1    0      0      0   1 0x00000008 AddrDuplicate
            28988          0      1    0      0      0   1 0x00000008 AddrDuplicate
            28906          0      1    0      0      0   1 0x00000008 AddrDuplicate
            28908          0      1    0      0      0   1 0x00000008 AddrDuplicate
            28909          0      1    0      0      0   1 0x00000008 AddrDuplicate
            28990          0      1    0      0      0   1 0x00000008 AddrDuplicate
            28996          0      1    0      0      0   1 0x00000008 AddrDuplicate
            28999          0      1    0      0      0   1 0x00000008 AddrDuplicate
           388503          0      1    0      0      0   1 0x00000008 AddrDuplicate
           389399          0      1    0      0      0   1 0x00000008 AddrDuplicate
           389400          0      1    0      0      0   1 0x00000008 AddrDuplicate
           389401          0      1    0      0      0   1 0x00000008 AddrDuplicate
           389402          0      1    0      0      0   1 0x00000008 AddrDuplicate
           389403          0      1    0      0      0   1 0x00000008 AddrDuplicate
           389404          0      1    0      0      0   1 0x00000008 AddrDuplicate
           389405          0      1    0      0      0   1 0x00000008 AddrDuplicate
           389406          0      1    0      0      0   1 0x00000008 AddrDuplicate
    Lost blocks were found.
    Correct the allocation map? no
    
    Lost blocks were found.
    Correct the allocation map? no
    
    Corrections are needed in the block allocation map.
    Correct the allocation map? no
    
    Lost blocks were found.
    Correct the allocation map? no
    
    Lost blocks were found.
    Correct the allocation map? no
  • Mount the filesystem on cluster manager node,
    [root@sgbmedprod05 ~]# mmmount /dcs/data31
    Thu Sep  8 15:08:56 MYT 2011: mmmount: Mounting file systems ...
  • Based on mmfsck output file (mmfsck-dcsdata31.out), I need to create a file which contains problematic inode number from InodeProblemList info. In this case, I create a file with the name of InodeProblemList-dcsdata31 and below are the contents,
    [root@sgbmedprod05 ~]# cat InodeProblemList-dcsdata31
    200274
    28893
    28985
    28988
    28906
    28908
    28909
    28990
    28996
    28999
    388503
    389399
    389400
    389401
    389402
    389403
    389404
    389405
    389406
  • use tnsfindinode to list out the files that corresponding to inodes that having problem,
    [root@sgbmedprod05 ~]# tsfindinode -i InodeProblemList-dcsdata31 /dcs/data31
         28893      /dcs/data31/TEST/InPortal/NSN_IN/20110818/cat.S1KEP01.20110818005538.20110818010038.0201.old
        389404      /dcs/data31/TEST/InPortal/NSN_IN/20110822/cft.S3NIL02.20110822052145.20110822052516.0648.old
        388503      /dcs/data31/TEST/InPortal/NSN_IN/20110822/cft.S3NIL01.20110822081330.20110822081510.0862.old
         28985      /dcs/data31/TEST/InPortal/NSN_IN/20110818/cat.S1KEP01.20110818052538.20110818053038.0595.old
         28999      /dcs/data31/TEST/InPortal/NSN_IN/20110817/cat.S1KEP01.20110817120603.20110817121054.1046.Z
         28906      /dcs/data31/TEST/InPortal/NSN_IN/20110817/cat.S1KEP01.20110817092609.20110817093109.0805.Z
         28996      /dcs/data31/TEST/InPortal/NSN_IN/20110817/cat.S1KEP01.20110817120109.20110817120603.1044.Z
         28909      /dcs/data31/TEST/InPortal/NSN_IN/20110817/cat.S1KEP01.20110817093609.20110817094109.0809.Z
        389400      /dcs/data31/TEST/InPortal/NSN_IN/20110822/cft.S3NIL02.20110822050648.20110822051044.0634.old
        389403      /dcs/data31/TEST/InPortal/NSN_IN/20110822/cft.S3NIL02.20110822051817.20110822052145.0646.old
        389399      /dcs/data31/TEST/InPortal/NSN_IN/20110822/cft.S3NIL02.20110822050232.20110822050648.0632.old
        389406      /dcs/data31/TEST/InPortal/NSN_IN/20110822/cft.S3NIL02.20110822052853.20110822053235.0659.old
         28908      /dcs/data31/TEST/InPortal/NSN_IN/20110817/cat.S1KEP01.20110817093109.20110817093609.0807.Z
         28990      /dcs/data31/TEST/InPortal/NSN_IN/20110818/cat.S1KEP01.20110818054038.20110818054538.0608.old
        389401      /dcs/data31/TEST/InPortal/NSN_IN/20110822/cft.S3NIL02.20110822051044.20110822051429.0636.old
         28988      /dcs/data31/TEST/InPortal/NSN_IN/20110818/cat.S1KEP01.20110818053538.20110818054038.0599.old
        389405      /dcs/data31/TEST/InPortal/NSN_IN/20110822/cft.S3NIL02.20110822052516.20110822052853.0650.old
        389402      /dcs/data31/TEST/InPortal/NSN_IN/20110822/cft.S3NIL02.20110822051429.20110822051817.0644.old
        200274      /dcs/data31/TEST/InPortal/NSN_IN/20110705/cat.S1KEP01.20110705015314.20110705015814.0413.old

    According to to IBM GPFS support, above files are no longer usable – in simple words, they can be considered as corrupted. He advised me to copy those files to other place, then delete them. But since they are no longer usable and must not be restored, why I have to bother to copy them?

  • To double confirm that I have deleted all those files, just run tnsfindinode again,
    [root@sgbmedprod05 ~]# tsfindinode -i InodeProblemList-dcsdata31 /dcs/data31
         28893      (notfound)
         28906      (notfound)
         28908      (notfound)
         28909      (notfound)
         28985      (notfound)
         28988      (notfound)
         28990      (notfound)
         28996      (notfound)
         28999      (notfound)
        200274      (notfound)
        388503      (notfound)
        389399      (notfound)
        389400      (notfound)
        389401      (notfound)
        389402      (notfound)
        389403      (notfound)
        389404      (notfound)
        389405      (notfound)
        389406      (notfound)
  • Unmount the filesystem again
    [root@sgbmedprod05 ~]# mmumount /dcs/data31 -a
    Thu Sep  8 15:17:42 MYT 2011: mmumount: Unmounting file systems ...
  • Perform mmfsck again, this time use -y (yes response to all prompts) parameter
    [root@sgbmedprod05 ~]# mmfsck /dev/dcsdata31 -v -y > mmfsck-dcsdata31-repair.out 2&>1
  • Examine the mmfsck output file and I get this,
    File system is clean.

    Yay, it means the filesystem is now healthy and I can remount the filesystem back,

    [root@sgbmedprod05 ~]# mmmount /dcs/data31 -a
    Thu Sep  8 15:22:58 MYT 2011: mmmount: Mounting file systems ...

Some interesting experience to share,

  • Can’t delete the files that corresponding to problematic inodes,

    [root@sgbmedprod05 ~]# tsfindinode -i InodeProblemList-dcsdata02 /dcs/data02
        350584      /dcs/data02/DATA_6.0/dcs/opr/N/NSN_ASC_TRN/ASC_OUT/09
        256133      /dcs/data02/DATA_6.0/dcs/opr/N/NSN_ASC_TRN/ASC_OUT/03
        454704      /dcs/data02/DATA_6.0/dcs/opr/N/NSN_DEC_TRN/DEC_OUT/05
    [root@sgbmedprod05 ~]# rm /dcs/data02/DATA_6.0/dcs/opr/N/NSN_ASC_TRN/ASC_OUT/09
    rm: cannot remove directory `/dcs/data02/DATA_6.0/dcs/opr/N/NSN_ASC_TRN/ASC_OUT/09': Is a directory
    [root@sgbmedprod05 ~]# rm -rf /dcs/data02/DATA_6.0/dcs/opr/N/NSN_ASC_TRN/ASC_OUT/09
    rm: cannot lstat `/dcs/data02/DATA_6.0/dcs/opr/N/NSN_ASC_TRN/ASC_OUT/09/NSN_IN_IN3CFT_347911.0102': Input/output error
    [root@sgbmedprod05 ~]# rm -rf /dcs/data02/DATA_6.0/dcs/opr/N/NSN_ASC_TRN/ASC_OUT/03
    rm: cannot lstat `/dcs/data02/DATA_6.0/dcs/opr/N/NSN_ASC_TRN/ASC_OUT/03/NSN_IN_IN1CFT_327423.0102': Input/output error
    [root@sgbmedprod05 ~]# rm -rf /dcs/data02/DATA_6.0/dcs/opr/N/NSN_DEC_TRN/DEC_OUT/05
    rm: cannot lstat `/dcs/data02/DATA_6.0/dcs/opr/N/NSN_DEC_TRN/DEC_OUT/05/NSN_IN_IN3CAT_192556.0103': Input/output error

    InodeProblemList info from mmfsck output states the error as DirEntryBad (instead of AddrDuplicate like previous example)

    InodeProblemList: 3 entries
    iNum           snapId     status keep delete noScan new error
    -------------- ---------- ------ ---- ------ ------ --- ------------------
           256133          0      1    0      0      0   1 0x00010000 DirEntryBad
           350584          0      1    0      0      0   1 0x00010000 DirEntryBad
           454704          0      1    0      0      0   1 0x00010000 DirEntryBad

    Solution:
    Run mmfsck -v -y, then the files should be able to deleted. Run mmfsck -v -y again so that the filesystem is really clean.

  • What is the symptom of problematic filesystem?
    I used to experience an incident whereby a filesystem was suddenly unmounted on 1 node (there rest of the nodes are ok). This was what I got from syslog,

    Aug 22 12:53:54 sgbmedprod08 mmfs: Error=MMFS_SYSTEM_UNMOUNT, ID=0xC954F85D, Tag=4765838:   Unrecoverable file system operation error.  Status code 16.   Volume dcsdata04
    Aug 22 13:51:16 sgbmedprod08 mmfs: Error=MMFS_SYSTEM_UNMOUNT, ID=0xC954F85D, Tag=4765839:   Unrecoverable file system operation error.  Status code 16.   Volume dcsappl01

    ————-

    Aug 24 11:40:25 sgbmedprod08 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=5534377:   Invalid disk data structure.  Error code 107.   Volume dcsdata02
                                       . Sense Data
    Aug 24 11:40:25 sgbmedprod08 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=5534377:    6B 00 10 00 00 00 01 00
    Aug 24 11:40:25 sgbmedprod08 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=5534377:    00 00 00 64 85 02 00 00
    Aug 24 11:40:25 sgbmedprod08 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=5534377:    00 00 20 00 00 00 43 01
    Aug 24 11:40:25 sgbmedprod08 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=5534377:    00 00 00 00 00 00 60 01
    Aug 24 11:40:25 sgbmedprod08 mmfs: Error=MMFS_FSSTRUCT, ID=0x94B1F045, Tag=5534377:    00 00 00 00 00 00 00 00
    Aug 24 11:40:25 sgbmedprod08 last message repeated 14 times

    And GPFS log gave more clue,

    Mon Aug 22 12:53:54.479 2011: File System dcsdata04 unmounted by the system with return code 16 reason code 0
    Mon Aug 22 12:53:54.480 2011: Device or resource busy
    Mon Aug 22 12:53:54 MYT 2011: mmcommon preunmount invoked.  File system: dcsdata04  Reason: SGPanic

    After logging a case to IBM GPFS support, they advise me to perform mmfsck as per example.

  • Comment(s) via Facebook

{ 2 } Comments

  1. mr.fakap | September 10, 2011 at 12:09 am | Permalink
    Using Google Chrome Google Chrome 13.0.782.220 on Windows Windows 7

    a g33k post after quite a while….

    Using Google Chrome Google Chrome 13.0.782.220 on Windows Windows 7
  2. Irwan | September 11, 2011 at 6:20 pm | Permalink
    Using Safari Safari 7534.48.3 on iOS iOS 5.0

    Cari makan bro.

    Using Safari Safari 7534.48.3 on iOS iOS 5.0

Post a Comment

Your email is never published nor shared. Required fields are marked *