问题描述:
2月26日收到某现场项目经理电话反馈现场Oracle RAC数据库发生宕机事件,但数据库已恢复正常,需要我方进行故障分析排查原因。
日志分析:
到了现场后通过现场人员对接登录到此前发生故障的Oracle数据库服务器,这是一套Oracle 11.2.0.4 RAC集群服务器。首先adrci日志分析器找到grid 的故障信息 ORA-15335,发生宕机时间是-02-24 17:16:00.
ORA-15335错误提示ASM 存储访问被中断。查看告警日志发现ASM diskgroup “DATA”磁盘被强制卸载了。
invalid ASM block header 无效的ASM磁盘文件头。也就是ASM磁盘文件头损坏。
通过Oracle 告警日志也可以看到:
根据告警日志分析到-02-24 23:30数据库完成了恢复启动正常:
Starting background process RSMN
Wed Feb 24 23:30:12
RSMN started with pid=54, OS id=28694
ORACLE_BASE not set in environment. It is recommended
that ORACLE_BASE be set in the environment
Reusing ORACLE_BASE from an earlier startup = /u01/app/oracle
Wed Feb 24 23:30:13
ALTER DATABASE MOUNT /* db agent//{2:26963:2}/
NOTE: Loaded library: System
SUCCESS: diskgroup DATA was mounted
NOTE: dependency between database orcl and diskgroup resource ora.DATA.dg is established
Wed Feb 24 23:30:25
Successful mount of redo thread 2, with mount id 65798379
Wed Feb 24 23:30:25
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE)
Lost write protection disabled
Completed: ALTER DATABASE MOUNT /db agent//{2:26963:2}/
ALTER DATABASE OPEN /db agent//{2:26963:2} */
Picked broadcast on commit scheme to generate SCNs
SUCCESS: diskgroup REDO was mounted
Wed Feb 24 23:30:27
NOTE: dependency between database orcl and diskgroup resource ora.REDO.dg is established
Thread 2 opened at log sequence 62641
Current log# 14 seq# 62641 mem# 0: +REDO/redo10.log
Successful open of redo thread 2
MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set
Wed Feb 24 23:30:27
SMON: enabling cache recovery
[28735] Successfully onlined Undo Tablespace 5.
Undo initialization finished serial:0 start:10506384 end:10
具体是如何导致ASM磁盘文件头损坏无法定位。
不过此次故障维护修复的单位人员说ORACLE 服务器ASM 磁盘文件属性被改成PVS属性,导致ORACLE无法识别到ASM头文件损坏磁盘组.
以下是第三方运维公司DBA提供的截图,根据截图我这边进行的分析:
通过检查ASM磁盘组信息发现,ASM磁盘组物理文件属性被人改成了PVS文件属性,ASM 是裸设备,而PVS是文件系统格式,所以ORACLE 无法识别。
通过kfed repair /dev/asm-datab完成ASM磁盘文件头修复,使得数据库启动成功。
事件总结建议:
这是一次Oracle存储文件属性被修改或者ASM数据文件头损坏导致的宕机事件。这套OracleRAC集群数据库大概有12TB,存储数据相当大,建议定期做好物理备份。建议增加Oracle dataguard双机热备,保证双份数据可用否则存储灾难故障则无法修复。另外根据机房人员透露服务器做的raid6,磁盘转速为7200转,这个存储磁盘IO性能特别差,建议增加到10000转~15000转。否则数据库读写性能很慢,影响业务。