I received a task to determine the root cause of server downtime. According to the info, the server was suddenly hung, it is running RHEL 5.2 on a HP Proliant Blade.
Operating System Info
[root@localhost ~]# uname -a Linux localhost 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:15 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux [root@localhost ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 5.2 (Tikanga)
hpasmcli> show server System : ProLiant BL460c G1 Serial No. : [deleted] ROM version : I15 02/29/2008 iLo present : Yes Embedded NICs : 2 NIC1 MAC: 00:21:5a:48:7b:cc NIC2 MAC: 00:21:5a:48:7b:ca Processor: 0 Name : Intel Xeon Stepping : 11 Speed : 3000 MHz Bus : 1333 MHz Core : 4 Thread : 4 Socket : 1 Level2 Cache : 8192 KBytes Status : Ok Processor total : 1 Memory installed : 16384 MBytes ECC supported : Yes
Error message from syslog
May 17 07:31:06 localhost kernel: ipmi_si(SI_CHECK_BMC): Failed to get Global Enables 0xc6. May 17 07:31:16 localhost hpasmxld: OsKcsExecCmd: IPMI NetFN 0x36 CMD: 0x2 has timed out! May 17 07:31:46 localhost last message repeated 3 times May 17 07:31:46 localhost hpasmxld: iLO 2 Communications Error - Attempting synchronization! May 17 07:32:31 localhost hpasmxld: iLO 2 has responded to reset request . . . May 17 07:32:31 localhost hpasmxld: Stopping the Watchdog Timer . . . May 17 07:32:31 localhost hpasmxld: Resetting Internal Data structures . . . May 17 07:32:31 localhost hpasmxld: Initializing Internal Data structures from iLO 2. . . May 17 07:32:31 localhost hpasmxld: The iLO 2 reset / synchronization has completed successfully May 17 07:32:31 localhost kernel: hpasmxld: segfault at 0000000000000031 rip 0000000000000031 rsp 00007fff20cc7808 error 4
It seems iLO was unexpectedly initiated Automatic Server Recovery (ASR) to the server, this is admit by HP Support Document. Nevertheless, I really disappointed because I did not find ASR detected in the log (based on hplog -v) when the error occurred. Hmm…
According to the solution given in that support document, HP advise customer to uninstall hp-OpenIPMI package.
[root@localhost ~]# rpm -qi hp-OpenIPMI Name : hp-OpenIPMI Relocations: (not relocatable) Version : 8.0.0 Vendor: Hewlett-Packard Company Release : 113.rhel5 Build Date: Sat 24 Nov 2007 02:07:02 AM SGT Install Date: Fri 13 Jun 2008 01:31:16 PM SGT Build Host: rhel5e Group : System Environment/Kernel Source RPM: hp-OpenIPMI-8.0.0-113.rhel5.src.rpm Size : 6860802 License: GNU Public License Signature : (none) Packager : Hewlett-Packard Company URL : http://www.hp.com/linux Summary : OpenIPMI +HP Description : This is an upgraded version of the Open IPMI device driver that is shipped as part of the standard Linux kernel. This release is for Linux 2.6.18+ kernels. This provides support for PCI Based Base Management Controllers that are truly interrupt driven. This package will NOT activate on it's own. The drivers for this release are place in the /opt/hp/hp-OpenIPMI/bin with a script that can be used to launch the IPMI drivers. This has been done as the changes made to the IPMI drivers are expected to be included in future Linux kernels. The hp-OpenIPMI driver can be built for any kernel like any other GPL Open Source application. You need to load the appropriate kernel-devel (for Red Hat releases) package to do this.
Or should I disable ASR too? According to the discussion in ITRC, disabling ASR can also prevent this problem from happening again.