联系:QQ(5163721)
标题:OEL6.2 EXT4 filesystemio_options=SETALL造成archivelog坏块
作者:Lunar©版权所有[文章允许转载,但必须以链接方式注明源地址,否则追究法律责任.]
又踩了个坑……
今天同事告诉我前天调整一个数据库的参数,重启后,备库总报错:
Errors in file /u01/oracle/app/diag/rdbms/lunar1db3/lunar1db3/trace/lunar1db3_rfs_26241.trc: ORA-00272: error writing archive log /u01/oradata/arch/1_775047_804791836.arc ORA-00354: corrupt redo log block header RFS[3]: Possible network disconnect with primary database Errors in file /u01/oracle/app/diag/rdbms/lunar1db3/lunar1db3/trace/lunar1db3_rfs_26237.trc: ORA-00272: error writing archive log /u01/oradata/arch/1_775048_804791836.arc ORA-00354: corrupt redo log block header Errors in file /u01/oracle/app/diag/rdbms/lunar1db3/lunar1db3/trace/lunar1db3_rfs_26239.trc: ORA-00272: error writing archive log /u01/oradata/arch/1_775046_804791836.arc ORA-00354: corrupt redo log block header RFS[1]: Possible network disconnect with primary database RFS[2]: Possible network disconnect with primary database Tue Apr 14 20:29:08 2015 RFS[4]: Assigned to RFS process 26243 RFS[4]: Identified database type as 'physical standby': Client is ARCH pid 21348 RFS[4]: Opened log for thread 1 sequence 775051 dbid 148164954 branch 804791836 Tue Apr 14 20:29:09 2015 RFS[5]: Assigned to RFS process 26245 RFS[5]: Identified database type as 'physical standby': Client is ARCH pid 21354 Tue Apr 14 20:29:09 2015 RFS[6]: Assigned to RFS process 26247 RFS[6]: Identified database type as 'physical standby': Client is ARCH pid 21356 CORRUPTION DETECTED: In redo blocks starting at block 1count 2048 for thread 1 sequence 775051 RFS[5]: Opened log for thread 1 sequence 775049 dbid 148164954 branch 804791836 RFS[6]: Opened log for thread 1 sequence 775050 dbid 148164954 branch 804791836 CORRUPTION DETECTED: In redo blocks starting at block 1count 2048 for thread 1 sequence 775049 CORRUPTION DETECTED: In redo blocks starting at block 1count 2048 for thread 1 sequence 775050 Errors in file /u01/oracle/app/diag/rdbms/lunar1db3/lunar1db3/trace/lunar1db3_rfs_26243.trc: ORA-00272: error writing archive log /u01/oradata/arch/1_775051_804791836.arc ORA-00354: corrupt redo log block header Errors in file /u01/oracle/app/diag/rdbms/lunar1db3/lunar1db3/trace/lunar1db3_rfs_26245.trc: ORA-00272: error writing archive log /u01/oradata/arch/1_775049_804791836.arc ORA-00354: corrupt redo log block header Errors in file /u01/oracle/app/diag/rdbms/lunar1db3/lunar1db3/trace/lunar1db3_rfs_26247.trc: ORA-00272: error writing archive log /u01/oradata/arch/1_775050_804791836.arc ORA-00354: corrupt redo log block header RFS[4]: Possible network disconnect with primary database RFS[5]: Possible network disconnect with primary database RFS[6]: Possible network disconnect with primary database Tue Apr 14 20:29:12 2015 RFS[7]: Assigned to RFS process 26249 RFS[7]: Identified database type as 'physical standby': Client is ARCH pid 21348 RFS[7]: Opened log for thread 1 sequence 775052 dbid 148164954 branch 804791836 CORRUPTION DETECTED: In redo blocks starting at block 1count 2048 for thread 1 sequence 775052 Errors in file /u01/oracle/app/diag/rdbms/lunar1db3/lunar1db3/trace/lunar1db3_rfs_26249.trc: ORA-00272: error writing archive log /u01/oradata/arch/1_775052_804791836.arc ORA-00354: corrupt redo log block header RFS[7]: Possible network disconnect with primary database
初分析,这个错误有点怪异,有redo 头损坏,有“Possible network disconnect with primary database”
尝试clear online redo log和standby redo log,没用
尝试在主库重建控制文件,然后再直接重启备库,还是上面的错误。
冷静下来,感觉不对劲,检查主库和备库,发现主库归档日志是按照sequence顺序生成
备库则是断断续续的,有的可以从主库传过来,有的传不过来
手工传过来,APPLY还是报错:
SYS@ lunar1db3> recover standby database; ORA-00279: change 6194107977528 generated at 04/13/2015 12:28:54 needed for thread 1 ORA-00289: suggestion : /u01/oradata/arch/1_775046_804791836.arc ORA-00280: change 6194107977528 for thread 1 is in sequence #775046 20:49:02 Specify log: {<RET>=suggested | filename | AUTO | CANCEL} /u01/oradata/arch/1_775046_804791836.arc ORA-00283: recovery session canceled due to errors ORA-00354: corrupt redo log block header ORA-00353: log corruption near block 2048 change 6194107977662 time 04/13/2015 12:28:54 ORA-00334: archived log: '/u01/oradata/arch/1_775046_804791836.arc' ORA-01112: media recovery not started SYS@ lunar1db3>
感觉是arch异常了,而且貌似所有报错的,都是没有从主库传过来的
而所有没传过来的,都是损坏的,因此手工传过来也没用,因此抽取一个arch进行校验:
SYS@ lunar1db3> alter system dump logfile '/u01/oradata/arch/1_775046_804791836.arc' VALIDATE; alter system dump logfile '/u01/oradata/arch/1_775046_804791836.arc' VALIDATE * ERROR at line 1: ORA-00354: corrupt redo log block header ORA-00353: log corruption near block 2048 change 6194107977662 time 04/13/2015 12:28:54 ORA-00334: archived log: '/u01/oradata/arch/1_775046_804791836.arc' Elapsed: 00:00:01.08 SYS@ lunar1db3>
结果显示,确实arch损坏了。
.
根据问题发生的时间,想起来那天调整数据库参数的时候,有一个filesystemio_options=SETALL
也就是调整文件系统AIO方式的,去年在公司还给大家发邮件说起来“ext4上不要使用 filesystemio_options=SETALL”
否则会造成数据库坏块,没想到今年自己被坑了……(哇哇大哭啊……)
具体可以参见Oracle 文档:
ORA-1578 ORA-353 ORA-19599 Corrupt blocks with zeros when filesystemio_options=SETALL on ext4 file system using Linux (Doc ID 1487957.1)
.
大概是说,在下面的版本上,如果使用ext4,那么设置了filesystemio_options=SETALL可能会造成数据库坏块:
- for RHEL5 kernel-2.6.18-238.el5 - RHEL5.6 Errata RHSA-2011-0017 or later ( [fs] ext4: move aio completion after unwritten extent con (Eric Sandeen) [617690] ) - for RHEL6 you need to have kernel-2.6.32-71 and later ( [fs] ext4: move aio completion after unwritten extent conversion (Christoph Hellwig) [589985] )
检查了一下我们的系统,果然命中:
[oracle@lunardb1 ~]$ uname -a Linux lunardb1.800best.com 2.6.32-300.3.1.el6uek.x86_64 #1 SMP Fri Dec 9 18:57:35 EST 2011 x86_64 x86_64 x86_64 GNU/Linux [oracle@lunardb1 ~]$ cat /etc/issue Oracle Linux Server release 6.2 Kernel \r on an \m [oracle@lunardb1 ~]$ mount /dev/mapper/vg_lunar1adg01-lv_root on / type ext4 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) /dev/sda1 on /boot type ext4 (rw) /dev/mapper/vg_lunar1adg01-LogVol02 on /u01 type ext4 (rw) /dev/mapper/vg_lunar1data01-oradata on /u01/oradata type ext4 (rw) /dev/mapper/vg_lunar1data01-tmpdata on /tmpdata type ext4 (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) [oracle@lunardb1 ~]$
后悔呀,上周类似的一个系统同样的一套参数调整,怎么没问题呢?
查看了一下,因为是ext3…………
[oracle@lunardb2 ~]$ uname -a Linux lunardb2.800best.com 2.6.18-308.el5 #1 SMP Fri Jan 27 17:17:51 EST 2012 x86_64 x86_64 x86_64 GNU/Linux [oracle@lunardb2 ~]$ cat /etc/issue Red Hat Enterprise Linux Server release 5.8 (Tikanga) Kernel \r on an \m [oracle@lunardb2 ~]$ [oracle@lunardb2 ~]$ mount /dev/sda1 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) /dev/mapper/vg00-lv00 on /tmpdata type ext3 (rw) /dev/mapper/vg00-lv01 on /hddata type ext3 (rw) /dev/sda3 on /u01 type ext3 (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) [oracle@lunardb2 ~]$
没办法,申请业务,短期内,只能修改数据库参数回到filesystemio_options=none,然后重启主库
至于备库,只能重建了……
.
dd了两个好的归档日志和1个损坏的归档日志,对比了一下,发现有一个关键标示位,确实写错了:
顺便说一下,这个bug在下面的版本已经被fixed了:
Updated kernel to version kernel-uek-2.6.39-200.29.3.el6uek