存储Cache 丢失导致数据库无法open的案例

本站文章除注明转载外，均为本站原创： 转载自love wife & love life —Roger的Oracle&MySQL技术博客

最近某客户的一套核心数据库由于存储问题导致清掉Cache之后无法启动。首先我们来看看数据库在启动的时候报什么错误：

Thu Sep 21 19:35:55 2017
WARNING: Write Failed. group:1 disk:3 AU:53436 offset:95744 size:1024
Errors in file /u01/app/oracle/diag/rdbms/ods/xxx2/trace/xxx2_lgwr_14636.trc:
ORA-15080: synchronous I/O operation to a disk failed
WARNING: failed to write mirror side 1 of virtual extent 43 logical extent 0 of file 265 in group 1 on disk 3 allocation unit 53436
Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx2/trace/xxx2_lgwr_14636.trc:
ORA-00345: redo log write error block 88251 count 2
ORA-00312: online log 3 thread 2: '+DATA/xxx/onlinelog/group_3.265.816035881'
ORA-15081: failed to submit an I/O operation to a disk
ORA-15081: failed to submit an I/O operation to a disk
Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx2/trace/xxx2_lgwr_14636.trc:
ORA-00340: IO error processing online log 3 of thread 2
ORA-00345: redo log write error block 88251 count 2
ORA-00312: online log 3 thread 2: '+DATA/xxx/onlinelog/group_3.265.816035881'
ORA-15081: failed to submit an I/O operation to a disk
ORA-15081: failed to submit an I/O operation to a disk
LGWR (ospid: 14636): terminating the instance due to error 340

错误并不复杂。可以看到Oracle这里已经无法正常写Redo logfile了。由于这套数据库是非归档，只有逻辑备份，因此即使恢复成功也面临数据丢失的可能性。首先我在尝试进行恢复时，发现居然无法mount数据库，在mount过程中实例被直接终止了，感觉非常奇怪。也没有报非常明显的错误。mount过程出错，那么无疑是controlfile存在异常；由于没有controlfile备份，因此这里先手工重建控制文件，如下是脚本:

CREATE CONTROLFILE REUSE DATABASE "XXX" RESETLOGS NOARCHIVELOG
MAXLOGFILES 16
MAXLOGMEMBERS 3
MAXDATAFILES 100
MAXINSTANCES 8
MAXLOGHISTORY 584
LOGFILE
GROUP 1 '+data/ods/ONLINELOG/group_1.257.816033845' SIZE 500M BLOCKSIZE 512,
GROUP 2 '+data/xxx/ONLINELOG/group_2.258.816033845' SIZE 500M BLOCKSIZE 512,
GROUP 3 '+data/xxx/ONLINELOG/group_3.265.816035881' SIZE 500M BLOCKSIZE 512,
GROUP 4 '+data/xxx/ONLINELOG/group_4.266.816035883' SIZE 500M BLOCKSIZE 512,
GROUP 5 '+data/xxx/ONLINELOG/group_5.275.816036347' SIZE 500M BLOCKSIZE 512,
GROUP 6 '+data/xxx/ONLINELOG/group_6.277.816036359' SIZE 500M BLOCKSIZE 512
DATAFILE
'+DATA/xxx/datafile/system.259.816033847',
'+DATA/xxx/datafile/sysaux.260.816033849',
'+DATA/xxx/datafile/undotbs1.261.816033851',
'+DATA/xxx/datafile/undotbs2.263.816033859',
'+DATA/xxx/datafile/users.264.816033859',
'+DATA/xxx/datafile/tbs_tbdata.278.816036381',
'+DATA/xxx/datafile/tbs_omdata.283.816036779',
'+DATA/xxx/datafile/tbs_cmdata.284.816036813',
'+DATA/xxx/datafile/tbs_dmdata.285.816036857',
'+DATA/xxx/datafile/tbs_dbetl.286.816036893',
'+DATA/xxx/datafile/tbs_schedule.287.816036909',
'+DATA/xxx/datafile/tbs_meast.288.816036915',
'+DATA/xxx/datafile/tbs_m1104.289.816036939',
'+DATA/xxx/datafile/tbs_mpisa.293.842192725',
'+DATA/xxx/datafile/tbs_mpfsc',
'+DATA/xxx/datafile/tbs_msafe',
'+DATA/xxx/datafile/tbs_mecsp',
'+DATA/xxx/datafile/tbs_mpbss',
'+DATA/xxx/datafile/tbs_mpbfc',
'+DATA/xxx/datafile/idx_cmdata'
CHARACTER SET ZHS16GBK;

重建完毕后。其实这里我首先尝试了进行noresetlogs创建，但是发现报错：

ORA-00600: internal error code, arguments: [2762], [4294967295], [1024000], [+DATA/xxx/onlinelog/group_3.265.816035881], [], [], [], [], [], [], [],
[]

很明显，Redo logfile有问题；看来还是只能Resetlogs方式创建。创建完毕之后，尝试进行了recover database using backup controlfile until cancel恢复操作；然后通过隐含参数强制open发现还是有如下错误：

Fri Sep 22 13:00:10 2017
SMON: enabling cache recovery
Instance recovery: looking for dead threads
Instance recovery: lock domain invalid but no dead threads
Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_ora_1593.trc (incident=120288):
ORA-00600: internal error code, arguments: [2662], [3], [3158008565], [3], [3159337219], [12582976], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/xxx/xxx1/incident/incdir_120288/xxx1_ora_1593_i120288.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_ora_1593.trc:
ORA-00600: internal error code, arguments: [2662], [3], [3158008565], [3], [3159337219], [12582976], [], [], [], [], [], []
Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_ora_1593.trc:
ORA-00600: internal error code, arguments: [2662], [3], [3158008565], [3], [3159337219], [12582976], [], [], [], [], [], []
Error 600 happened during db open, shutting down database
USER (ospid: 1593): terminating the instance due to error 600
Instance terminated by USER, pid = 1593
ORA-1092 signalled during: alter database open resetlogs...

这是非常经典的错误了；由于这是scn的问题；而且数据库版本为11.2.0.3.0，未安装任何psu。因此这里是可以直接推进scn的。

直接通过10015 event来推进数据库的scn；另外由于是异常关机，那么这里Undo必然也无法进行正常恢复；因此同时设置undo_management参数为manual，并同时设置10015 event：

alter session set events ‘10015 trace name adjust_scn level 2’;

顺利打开了数据库。打开数据库之后立刻重建数据库Undo和temp，如下：

create undo tablespace undo1 datafile '+data' size 2048m;
create undo tablespace undo2 datafile '+data' size 2048m;
drop tablespace undotbs1 including contents and datafiles;
drop tablespace undotbs2 including contents and datafiles;
alter tablespace temp add tempfile '+data/rep/tempfile/TEMP.276.816036349' reuse;
alter tablespace temp add tempfile '+data/ods/tempfile/temp1607202' reuse;
alter tablespace temp add tempfile '+data/ods/tempfile/temp1607203' reuse;

再次重启数据库之后，发现alert log仍然有一些错误；这是在所难免的。如下所示：

ORA-00604: error occurred at recursive SQL level 1
ORA-08102: index key not found, obj# 290, file 1, block 1029 (2)
ORA-12012: error on auto execute of job 4001
ORA-08102: index key not found, obj# 290, file 1, block 1029 (2)
Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_j003_8160.trc:
ORA-00604: error occurred at recursive SQL level 1
ORA-08102: index key not found, obj# 290, file 1, block 1029 (2)
ORA-12012: error on auto execute of job 4002
ORA-08102: index key not found, obj# 290, file 1, block 1029 (2)
Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_ora_8043.trc:
Fri Sep 22 13:30:40 2017
Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_ora_8043.trc:
Fri Sep 22 13:30:42 2017
Dumping diagnostic data in directory=[cdmp_20170922133042], requested by (instance=1, osid=8043), summary=[abnormal process termination].
Fri Sep 22 13:31:59 2017
Starting background process SMCO
Fri Sep 22 13:31:59 2017
SMCO started with pid=35, OS id=9375
Fri Sep 22 13:37:54 2017
Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_m000_10623.trc (incident=144379):
ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/xxx/xxx1/incident/incdir_144379/xxx1_m000_10623_i144379.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Fri Sep 22 13:37:55 2017
Dumping diagnostic data in directory=[cdmp_20170922133755], requested by (instance=1, osid=10623 (M000)), summary=[incident=144379].
Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_m000_10623.trc (incident=144380):
ORA-00600: internal error code, arguments: [kewrose_1], [600], [ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []
], [], [], [], [], [], [], [], [], []
Incident details in: /u01/app/oracle/diag/rdbms/xxx/xxx1/incident/incdir_144380/xxx1_m000_10623_i144380.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /u01/app/oracle/diag/rdbms/xxx/xxx1/trace/xxx1_m000_10623.trc:
ORA-00600: internal error code, arguments: [kewrose_1], [600], [ORA-00600: internal error code, arguments: [kdsgrp1], [], [], [], [], [], [], [], [], [], [], []
], [], [], [], [], [], [], [], [], []
Dumping diagnostic data in directory=[cdmp_20170922133757], requested by (instance=1, osid=10623 (M000)), summary=[incident=144380].
Fri Sep 22 13:37:58 2017
Sweep [inc][144380]: completed

实际上当时在进行恢复时，我手工处理掉了obj＃ 290；但是进一步检查发现obj$,col_usage$ ,i_obj4# 都存在问题。而且不一致的记录还比较多：

select /*+ index(t i_obj4) */ DATAOBJ#,type#,owner# from obj$ t
minus
select /*+ full(t1) */ DATAOBJ#,type#,owner# from obj$ t1;
DATAOBJ# TYPE# OWNER#
---------- ---------- ----------
1451154 2 90
1589557 1 92
1589558 2 92
1589573 2 100
1589574 2 100
1589575 2 100
1589576 2 100
1589577 2 100
1589578 2 100
1589579 2 100
1589580 2 100
1589581 2 100
1589582 2 100
1589583 2 100
1589584 2 100
1589585 2 100
1589586 2 100
1589587 2 100
1589588 2 100
1589589 2 100
1589590 2 100
1589591 2 100
1589592 2 100
1589593 2 100
1589594 2 100
1589595 2 100
1589596 2 100
1589597 2 100
1589598 2 100
1589599 2 100
1589600 2 100
1589601 2 100
1589602 2 100
1589603 2 100
1589604 2 100
1589607 2 100
1589612 0 0



select /*+ full(t1) */ DATAOBJ#,type#,owner# from obj$ t1
minus
select /*+ index(t i_obj4) */ DATAOBJ#,type#,owner# from obj$ t ；

DATAOBJ# TYPE# OWNER#
---------- ---------- ----------
1587659 2 100
1587660 2 100
1587661 2 100
1587662 2 100
1587663 2 100
1587664 2 100
1587665 2 100
1587666 2 100
1587667 2 100
1587668 2 100
1587669 2 100
1587670 2 100
1587671 2 100
1587672 2 100
1587673 2 100
1587674 2 100
1587675 2 100
1587676 2 100
1587677 2 100
1587678 2 100
1587679 2 100
1587680 2 100
1587681 2 100
1587682 2 100
1587683 2 100
1587684 2 100
1587685 2 100
1587687 2 100
1587688 2 100
1587689 2 100
1587690 2 100
1587691 2 100
1587692 2 100
1587695 1 92
1587696 2 92
1587716 2 90
1587717 2 90
1587718 2 90
1587719 2 90
1587720 2 90
1587721 2 90
1587722 2 90
1587723 2 90
1587724 2 90
1587725 2 90
1587726 2 90
1587727 2 90
1587728 2 90
1587729 2 90
1587730 2 90
1587732 2 90
1587733 2 90
1589527 0 0

最开始我还尝试通过bbed修复了2个Block；最后发现依然难以处理这个ora-08102错误；后续通过上述sql比较发现居然有如此多的记录不一致。修改起来太过麻烦了。

这里其实本来想尝试通过重建obj$,i_obj4$,col_usage$ 来解决的。但是担心有较大的风险,因此这里建议可以进行了数据库重建。由于obj$这里有问题，expdp操作都报错，无法执行任何ddl操作。因此最好通过exp拆分脚本来进行重建处理。整个数据库恢复重建过程将近20小时左右(2tb左右的库).

由于客户存储环境io较差，因此导致整个重建过程比较复杂，比较耗时。我们在开玩笑讲到，如果可能的数据库运行在我们的Zdata环境上，那么数据库重建过程在2小时内即可完成；而且也不会出现类似故障；因此Zdata的io操作上直接落盘或者写到Pcie上；不存在数据丢失的风险。

最后补充一点：

1) 由于数据库很多事务无法正常恢复，导致SMON在不断尝试进行事务恢复时报错，达到一定次数之后会crash实例；进而影响数据库的重建工作。可通过设置_smon_internal_errlimit 参数来避免该问题。

2) 为了加快exp和imp速度，这里我们利用了管道技术，脚本如下：

mknod /backup/omdata10
exp   \'/ as sysdba\' parfile=omdata.par file=/backup/omdata10 rows=y indexes=n compress=n direct=y recordlength=65536 buffer=52488000 feedback=100000 volsize=0 log=omdata_other.log;
imp  \'/ as sysdba\' file=/backup/omdata10 fromuser=omdata touser=omdata buffer=52488000

3D Cloud

QQ咨询

最近发表

Contributors