上个周末,一台数据库服务器SUN E4500因为故障,温度过高导致当机,那么温度有多高呢?
[ID 110001 kern.warning] WARNING: SBus FFB SOC+ IO board 1 is very hot (temperature: 68C)
[ID 516145 kern.warning] WARNING: System shutdown scheduled in 20 seconds due to
 over-temperature condition on SBus FFB SOC+ IO board 1
[ID 350302 kern.notice] NOTICE: SBus FFB SOC+ IO board 1 is cooling (temperature: 67C)
[ID 538492 kern.notice] NOTICE: System shutdown due to over-temperature condition cancelled
[ID 110001 kern.warning] WARNING: SBus FFB SOC+ IO board 1 is very hot (temperature: 68C)
[ID 516145 kern.warning] WARNING: System shutdown scheduled in 20 seconds due to
 over-temperature condition on SBus FFB SOC+ IO board 1
[ID 350302 kern.notice] NOTICE: SBus FFB SOC+ IO board 1 is cooling (temperature: 67C)
[ID 538492 kern.notice] NOTICE: System shutdown due to over-temperature condition cancelled
[ID 110001 kern.warning] WARNING: SBus FFB SOC+ IO board 1 is very hot (temperature: 68C)
[ID 516145 kern.warning] WARNING: System shutdown scheduled in 20 seconds due to
 over-temperature condition on SBus FFB SOC+ IO board 1
[ID 350302 kern.notice] NOTICE: SBus FFB SOC+ IO board 1 is cooling (temperature: 67C)
[ID 538492 kern.notice] NOTICE: System shutdown due to over-temperature condition cancelled
[ID 110001 kern.warning] WARNING: SBus FFB SOC+ IO board 1 is very hot (temperature: 68C)
[ID 516145 kern.warning] WARNING: System shutdown scheduled in 20 seconds due to
 over-temperature condition on SBus FFB SOC+ IO board 1
[ID 350302 kern.notice] NOTICE: SBus FFB SOC+ IO board 1 is cooling (temperature: 67C)
[ID 538492 kern.notice] NOTICE: System shutdown due to over-temperature condition cancelled
[ID 110001 kern.warning] WARNING: SBus FFB SOC+ IO board 1 is very hot (temperature: 68C)
[ID 516145 kern.warning] WARNING: System shutdown scheduled in 20 seconds due to
 over-temperature condition on SBus FFB SOC+ IO board 1
[ID 470940 kern.warning] WARNING: SBus FFB SOC+ IO board 1 still too hot (temperature: 68C).
 Overtemp shutdown started
系统Shutdown的时候,温度达到了68度。在这寒冷的冬日里,这个温度真实太温暖了。
启动后检查,是一块IO板出了问题:
bash-2.03# /usr/platform/sun4u/sbin/prtdiag -v
System Configuration:  Sun Microsystems  sun4u 8-slot Sun Enterprise E4500/E5500
系统时钟频率:100 MHz
内存大小:2048Mb
========================= CPUs =========================
                    Run   Ecache   CPU    CPU
Brd  CPU   Module   MHz     MB    Impl.   Mask
---  ---  -------  -----  ------  ------  ----
 0     0     0      400     8.0   US-II    10.0
 0     1     1      400     8.0   US-II    10.0
 2     4     0      400     8.0   US-II    10.0
 2     5     1      400     8.0   US-II    10.0
 4     8     0      400     8.0   US-II    10.0
 4     9     1      400     8.0   US-II    10.0
========================= 内存 =========================
                                              Intrlv.  Intrlv.
Brd   Bank   MB    Status   Condition  Speed   Factor   With
---  -----  ----  -------  ----------  -----  -------  -------
 0     0    1024   Active      OK       60ns    2-way     A
 2     0    1024   Active      OK       60ns    2-way     A
========================= IO 卡 =========================
     Bus   Freq
Brd  Type  MHz   Slot        Name                          Model
---  ----  ----  ----------  ----------------------------  --------------------
 1   SBus   25            0  SUNW,socal/sf (scsi-3)        501-5266           
 1   SBus   25            3  SUNW,hme                                         
 1   SBus   25            3  SUNW,fas/sd (block)                              
 1   SBus   25           13  SUNW,socal/sf (scsi-3)        501-3060           
 1   UPA   100            2  FFB, Double Buffered          SUNW,501-4790      
Detached Boards
===============
  Slot  State       Type           Info
  ----  ---------   ------         -----------------------------------------
    3      failed   disk           Disk 0: no disk      Disk 1: no disk      
系统中失败的字段取代单元 (FRU):
==============================================
disk-board 在 IO 板上不可用 #3 上
 PROM 错误字符串:fail
 失败的字段取代单元为 IO 板 3
Detected System Faults
======================
Board 1 fault: Overtemp
        Detected Sat Dec 16 02:24:21 2006
Unit 2 Core Power Supply failure
        Detected Fri Dec 15 23:24:23 2006
Unit 1 Core Power Supply failure
        Detected Fri Dec 15 23:24:23 2006
PROM detected failure
        Detected Fri Dec 15 23:24:23 2006
最近的 AC 电源故障:
=============================
Fri May 27 14:53:06 2005
========================= 环境状态 =========================
Keyswitch position is in Normal Mode
System Power Status: Minimum Available
System LED Status:    GREEN     YELLOW     GREEN
WARNING                ON        ON        BLINKING
Fans:
-----
Unit   Status
----   ------
Rack    OK
Key     OK
AC      OK
System Temperatures (Celsius):
------------------------------
Brd   State   Current  Min  Max  Trend
---  -------  -------  ---  ---  -----
 0      OK       39     36   43  stable
 1   WARNING     66     46   67  stable
 2      OK       39     36   43  stable
 4      OK       53     50   55  stable
CLK     OK       38     37   40  stable
Power Supplies:
---------------
Supply                        Status
---------                     ------
0                                OK
1                                FAIL
2                                FAIL
3                                OK
PPS                              OK
    System 3.3v                  OK
    System 5.0v                  OK
    Peripheral 5.0v              OK
    Peripheral 12v               OK
    Auxilary 5.0v                OK
    Peripheral 5.0v precharge    OK
    Peripheral 12v precharge     OK
    System 3.3v precharge        OK
    System 5.0v precharge        OK
AC Power                         OK
========================= HW Revisions =========================
ASIC Revisions:
---------------
Brd  FHC  AC  SBus0  SBus1  PCI0  PCI1  FEPS  Board Type      Attributes
---  ---  --  -----  -----  ----  ----  ----  ----------      ----------
 0    1    5                                  CPU             100MHz Capable
 1    1    5           1                 22   UPA-SBus-SOC+   100MHz Capable
 2    1    5                                  CPU             100MHz Capable
 3                                            Unknown         100MHz Capable
 4    1    5                                  CPU             100MHz Capable
Board 1 FFB Hardware Configuration:
-----------------------------------
        Board rev: 2
        FBC version: 0x3241906d
        DAC: Brooktree 9070, version 1
        3DRAM: Mitsubishi 130b, version 2
System Board PROM revisions:
----------------------------
Board  0:   OBP   3.2.29 2001/06/18 17:28   POST  3.9.29 2001/06/18 17:50
Board  1:   FCODE 1.8.29 2001/06/18 17:26   iPOST 3.4.29 2001/06/18 17:49
Board  2:   OBP   3.2.29 2001/06/18 17:28   POST  3.9.29 2001/06/18 17:50
Board  4:   OBP   3.2.29 2001/06/18 17:28   POST  3.9.29 2001/06/18 17:50
更郁闷的是,目前这台服务器处于关键运营时期,还不能重新启动更换硬件。
只好等下次何时Down机。
-The End-