Tuesday 22 November 2011

Instance crash recovery, Why,When,How in Oracle 10g

Whenever, Accidently power failure causes server shutdown, Shutdown abort and any other reason that cause database to go down in inconsistent way requires instance recovery. Before we go deep into this we need to understand working of SMON and SCN.


What is this SMON:- SMON is one of the MUST HAVE back ground process. And this is the process that recovers database from instance crash.  SMON checks the SCN in all the data files headers and compare this with SCN stored in Control file. If these two matches means database is in consistent state and can be opened.

What is System Change Number (SCN):- This is the number  (always increasing) incremented whenever a Commit has happened or  AQ, SMON,PMON has completed their work. To get current SCN we need to use DBMS_FLASHBACK.Get_System_Change_Number. SCN is also used in Flashback Query

Eg Select * from scott.emp as of scn 1232097
SCN is checked whenever Oracle starts up the database. SMON compares the SCN of Checkpoint_Change#(v$database.Chechpoint_Change# with v$datafile.checkpoint_change# and v$datafile_header.checkpoint_change#) and if all the SCN are same then database is in consistent state and can be opened.  
SCN plays a vital role for providing consistent reads. Basically, it works as follows: The query reads a db block. Beside other information block  also contains the information regarding when last time block was changed.. If this SCN is greater than the SCN that was in place when (our) query began, it means that the block was changed after we have started our query. So we have to find an older version of the block. If this block is found in the rollback segments, we use it for our query.

The v$ views use change# to report SCN values. Hence, it can be argued that SCN means System Change Number, not System Commit Number.
When exporting, an SCN can be specified with the exp_flashback_scn export parameter.

More on SCN
There are 3 SCNs basically in control file
1.            Checkpoint SCN
2.            Stop SCN
3.            Thread checkpoint SCN

Checkpoint SCN is the datafile checkpoint SCN when checkpoint happens for datafile. This checkpoint SCN is recorded in datafile header as well.

Stop SCN is the SCN which gets recoreded in control file when datafile is taken in begin backup mode or when datafile is taken offline. This is the checkpoint at a point when datafile header is freezed.

Thread Checkpoint SCN is the one related to online redo log files. This SCN gets generated when ever transaction get recoreded in online redo log file. When we shut down database with normal or immediate option, all these SCN are synchronized and made equal.
/

Eg.
SQL> select banner from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Prod
PL/SQL Release 10.2.0.1.0 - Production
CORE    10.2.0.1.0      Production
TNS for 32-bit Windows: Version 10.2.0.1.0 - Production
NLSRTL Version 10.2.0.1.0 – Production



Now! Check the Database checkpoint  change number using V$DATABASE. With this number (CHECKPOINT_CHAGNE#) , all the number (CHECKPOINT_CHANGE#) of datafile header should match.



SQL> select name,checkpoint_change# from v$database;




NAME      CHECKPOINT_CHANGE#
---------         ------------------
SEED                 1151635


SQL> select name,checkpoint_change# from v$datafile;
Name                                                                                                    Checkpoint_Change#        Last_Change#
-------------------                                                                           --------------------------------        --------------------------
D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSTEM01.DBF                   1151635  1151635
D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\UNDOTBS01.DBF                               1151635  1151635
D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSAUX01.DBF                   1151635  1151635
D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\USERS01.DBF                      1151635  1151635
D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\EXAMPLE01.DBF                                1151635  1151635
D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\RMAN_TBS01.DBF            1151635  1151635

Now, The CHECKPOINT_CHANGE# of Database and Datafile Header are same and more importantly Start SCN (Checkpoint_change#) and Stop SCN (Last_Change#) are same means START and STOP SCN are same and this confirms that our database is consistent state and does not requires any kind of recovery.
So when you use shutdown immediate or Shutdown normal, Database goes down with consistent state. However, this is not the case with “Shutdown Abort” statement.

To understand fully, We need to understand actual working of “Shutdown” process,  SCN ,Checkpoint_Change#, Last_Change#.

Shutdown Process:
Shutdown Immediate/Shutdown Normal: This is the most clean method to take database down. Using this statement, DB goes down in consistent state. Following are the stages:

Shutdown Immediate:

SQL> select open_mode from v$database;

OPEN_MODE
----------
READ WRITE




SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.

Internally, First Oracle closes all the Background (SMON, PMON, CHKPT, DBWR, CJQ0,MMNL,MMON) processes . Secondly if Archival is enabled then it stops. Thirdly, All  uncommitted transaction are rolled back, further logons stops. Checkpoint  happens, All the blocks ( Changed, Committed)  are flushed to Hard disk.  If any trigger (System Trigger) is written on BEFORE SHUTDOWN system event than it fires.
Instances withers away, All Locks, Enaques  (TX) are released. “ALTER DATABASE CLOSE NORMAL” and “ALTER DATABASE DISMOUNT” statements are executed. After all these happenings our database is now SHUT.
In this scenario, Before closing down Oracle does CHECKPOINT that makes database state consistent. Means no Online REDO logs will be required to open database. Once you open the database Checkpoint_Change#  of Database (Controlfile),Datafile(s) and Datafile_Headers will be same.

Checkpoint change number of controlfile.

SQL> Select Checkpoint_Change#,name from v$database;
CHECKPOINT_CHANGE#                                               NAME
----------------------------                                 -------------
1202269                                                                SEED

Checkpoint change number of Datafile(s).

SQL>Select Checkpoint_Change#,name,last_change# from v$datafile;
CHECKPOINT_CHAGNE#   NAME                                                                                                                               Last_Change#  
---------------------                             ---------------------------------                                                        -------------------------------
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSTEM01.DBF   1151635
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\UNDOTBS01.DBF               1151635
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSAUX01.DBF   1151635
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\USERS01.DBF        1151635
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\EXAMPLE01.DBF                  1151635
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\RMAN_TBS01.DBF1151635

Checkpoint Change number in Header of Datafile(s).

SQL>Select dh.checkpoint_change#,df.name from v$datafile_header dh,v$datafile df
where df.file#=dh.file#





CHECKPOINT_CHAGNE#   NAME                                                               CHECKPOINT_CHAGNE#                                                                              
---------------------                             ---------------------------------      ----------------------------------------
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSTEM01.DBF  
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\UNDOTBS01.DBF              
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSAUX01.DBF  
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\USERS01.DBF     
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\EXAMPLE01.DBF               
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\RMAN_TBS01.DBF           

In this process, Every data file has the same CHECKPOINT_CHANGE# number as DATABASE has. So database does not need any kind of recovery (INSTANCE) recovery.

Now! See how SHUTDOWN ABORT just like a crash and why it needs recovery.


SQL> shutdown abort;
ORACLE instance shut down.

In  this scenario, Database has not done anything to make things consistent instead it is just crashed.
However, unlike Media recovery where we need not to apply Archived REDO logs all the changes stored in Online REDO Log files will be re-applied and instance will recover itself.
In Other words START SCN != STOP SCN so recovery is required.



SQL> startup mount;
ORACLE instance started.

Total System Global Area  612368384 bytes
Fixed Size                  1250428 bytes
Variable Size             230689668 bytes
Database Buffers          373293056 bytes
Redo Buffers                7135232 bytes
Database mounted.

SQL> Select name,Checkpoint_Change# from v$database;

NAME      CHECKPOINT_CHANGE#
---------                    ------------------
SEED                 1260508
SQL>Select Checkpoint_Change#,name,last_change# from v$datafile;
CHECKPOINT_CHAGNE#   NAME                                                                                                                               Last_Change#  
---------------------                             ---------------------------------                                                        -------------------------------
1260508                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSTEM01.DBF                   Null
1260508                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\UNDOTBS01.DBF                              Null
1260508                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSAUX01.DBF                   Null
1260508                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\USERS01.DBF                      Null
1260508                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\EXAMPLE01.DBF                               Null
1260508                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\RMAN_TBS01.DBF            Null


Next time DBA starts up the database, it with out asking any recovery will automatically apply contents of Online REDO log files (ROLLFORD) and will open the database.
SQL> startup
ORACLE instance started.

Total System Global Area  612368384 bytes
Fixed Size                  1250428 bytes
Variable Size             234883972 bytes
Database Buffers          369098752 bytes
Redo Buffers                7135232 bytes
Database mounted.
Database opened.


I have tried to put the instance recovery process in clear and precise words, but still if anything is missing then any correction, suggestion are most welcome.

Tuesday 15 November 2011

System datafile recovery

System datafile lost:-- In this scenario,Hardware failure (Media failure) causes loss of System01.dbf file. And when DBA tries to start the database gets following error:

SQL> startup
ORACLE instance started.
Total System Global Area  612368384 bytes
Fixed Size                  1250428 bytes
Variable Size             176163716 bytes
Database Buffers          427819008 bytes
Redo Buffers                7135232 bytes
Database mounted.
ORA-01157: cannot identify/lock data file 1 - see DBWR trace file
ORA-01110: data file 1: 'D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSTEM01.DBF'

Now! To see any message/note regarding this we can check DBWR trace file located at
D:\oracle\product\10.2.0\admin\seed\bdump location. The naming convention of trace file is like
SID_DBWR_<srno>.trc. This trace file contains following message:
Dump file d:\oracle\product\10.2.0\admin\seed\bdump\seed_dbw0_1100.trc
Tue Nov 15 15:43:10 2011
ORACLE V10.2.0.1.0 - Production vsnsta=0
vsnsql=14 vsnxtr=3
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production
With the Partitioning, OLAP and Data Mining options
Windows Server 2003 Version V5.2 Service Pack 2
CPU                 : 2 - type 586, 1 Physical Cores
Process Affinity    : 0x00000000
Memory (Avail/Total): Ph:552M/2037M, Ph+PgF:1134M/3935M, VA:1302M/2047M
Instance name: seed

Redo thread mounted by this instance: 1

Oracle process number: 5

Windows thread id: 1100, image: ORACLE.EXE (DBW0)


*** SERVICE NAME:() 2011-11-15 15:43:10.613
*** SESSION ID:(167.1) 2011-11-15 15:43:10.613
ORA-01157: cannot identify/lock data file 1 - see DBWR trace file
ORA-01110: data file 1: 'D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSTEM01.DBF'
ORA-27041: unable to open file
OSD-04002: unable to open file
O/S-Error: (OS 2) The system cannot find the file specified.

Recovery step in this scenario are:
1:- Take DBF offline;
2:- Restore DBF from Backup;
3:- Recover  datafile;
4:- Take DBF online.


Before we proceed with Recovery, We check the existence of backup of Datafile 1(System01.dbf).

C:\ rman target / catalog rman_user/rmRecovery Manager: Release 10.2.0.1.0 - Production on Tue Nov 15 16:56:22 2011

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

connected to target database: SEED (DBID=848131167, not open)
connected to recovery catalog databasean_user@seed_82

RMAN> list backup of datafile 1;
List of Backup Sets
===================

BS Key  Type LV Size       Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ ---------------
1248    Full    134.63M    DISK        00:00:41     15-NOV-11
        BP Key: 1249   Status: AVAILABLE  Compressed: YES  Tag: FULL_DB_BKUP
        Piece Name: D:\ORACLE\RMAN_BKUP\SEED\SEED_%R_01MRNTAO.BKP
  List of Datafiles in backup set 1248
  File LV Type Ckp SCN    Ckp Time  Name
  ---- -- ---- ---------- --------- ----
  1       Full 1150785    15-NOV-11 D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSTEM01.DBF


Now! We can see that this particular backupset can be used for Recovery of System data file.

Step1:- Take System01.dbf (1) offline

RMAN> Sql "Alter Database Datafile 1 Offline";

sql statement: Alter Database Datafile 1 Offline

Step 2:- Restore Datafile 1 from backup

RMAN> Restore datafile 1;

Starting restore at 15-NOV-11
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=155 devtype=DISK

channel ORA_DISK_1: starting datafile backupset restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
restoring datafile 00001 to D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSTEM01.DBF
channel ORA_DISK_1: reading from backup piece D:\ORACLE\RMAN_BKUP\SEED\SEED_%R_01MRNTAO.BKP
channel ORA_DISK_1: restored backup piece 1
piece handle=D:\ORACLE\RMAN_BKUP\SEED\SEED_%R_01MRNTAO.BKP tag=FULL_DB_BKUP
channel ORA_DISK_1: restore complete, elapsed time: 00:00:56
Finished restore at 15-NOV-11

Step 3:- Recover the datafile
RMAN> Recover datafile 1;

Starting recover at 15-NOV-11
using channel ORA_DISK_1

starting media recovery
media recovery complete, elapsed time: 00:00:02

Finished recover at 15-NOV-11
Step 4:- Take datafile online

RMAN> Sql "Alter database datafile 1 online";

sql statement: Alter database datafile 1 online

Note:- The timing of this recovery varies and depends upon the DB size,Number of Archive REDO log to apply.

Any comment/Suggestion will be welcome.