Tuesday, 22 November 2011

Instance crash recovery, Why,When,How in Oracle 10g

Whenever, Accidently power failure causes server shutdown, Shutdown abort and any other reason that cause database to go down in inconsistent way requires instance recovery. Before we go deep into this we need to understand working of SMON and SCN.


What is this SMON:- SMON is one of the MUST HAVE back ground process. And this is the process that recovers database from instance crash.  SMON checks the SCN in all the data files headers and compare this with SCN stored in Control file. If these two matches means database is in consistent state and can be opened.

What is System Change Number (SCN):- This is the number  (always increasing) incremented whenever a Commit has happened or  AQ, SMON,PMON has completed their work. To get current SCN we need to use DBMS_FLASHBACK.Get_System_Change_Number. SCN is also used in Flashback Query

Eg Select * from scott.emp as of scn 1232097
SCN is checked whenever Oracle starts up the database. SMON compares the SCN of Checkpoint_Change#(v$database.Chechpoint_Change# with v$datafile.checkpoint_change# and v$datafile_header.checkpoint_change#) and if all the SCN are same then database is in consistent state and can be opened.  
SCN plays a vital role for providing consistent reads. Basically, it works as follows: The query reads a db block. Beside other information block  also contains the information regarding when last time block was changed.. If this SCN is greater than the SCN that was in place when (our) query began, it means that the block was changed after we have started our query. So we have to find an older version of the block. If this block is found in the rollback segments, we use it for our query.

The v$ views use change# to report SCN values. Hence, it can be argued that SCN means System Change Number, not System Commit Number.
When exporting, an SCN can be specified with the exp_flashback_scn export parameter.

More on SCN
There are 3 SCNs basically in control file
1.            Checkpoint SCN
2.            Stop SCN
3.            Thread checkpoint SCN

Checkpoint SCN is the datafile checkpoint SCN when checkpoint happens for datafile. This checkpoint SCN is recorded in datafile header as well.

Stop SCN is the SCN which gets recoreded in control file when datafile is taken in begin backup mode or when datafile is taken offline. This is the checkpoint at a point when datafile header is freezed.

Thread Checkpoint SCN is the one related to online redo log files. This SCN gets generated when ever transaction get recoreded in online redo log file. When we shut down database with normal or immediate option, all these SCN are synchronized and made equal.
/

Eg.
SQL> select banner from v$version;

BANNER
----------------------------------------------------------------
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Prod
PL/SQL Release 10.2.0.1.0 - Production
CORE    10.2.0.1.0      Production
TNS for 32-bit Windows: Version 10.2.0.1.0 - Production
NLSRTL Version 10.2.0.1.0 – Production



Now! Check the Database checkpoint  change number using V$DATABASE. With this number (CHECKPOINT_CHAGNE#) , all the number (CHECKPOINT_CHANGE#) of datafile header should match.



SQL> select name,checkpoint_change# from v$database;




NAME      CHECKPOINT_CHANGE#
---------         ------------------
SEED                 1151635


SQL> select name,checkpoint_change# from v$datafile;
Name                                                                                                    Checkpoint_Change#        Last_Change#
-------------------                                                                           --------------------------------        --------------------------
D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSTEM01.DBF                   1151635  1151635
D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\UNDOTBS01.DBF                               1151635  1151635
D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSAUX01.DBF                   1151635  1151635
D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\USERS01.DBF                      1151635  1151635
D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\EXAMPLE01.DBF                                1151635  1151635
D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\RMAN_TBS01.DBF            1151635  1151635

Now, The CHECKPOINT_CHANGE# of Database and Datafile Header are same and more importantly Start SCN (Checkpoint_change#) and Stop SCN (Last_Change#) are same means START and STOP SCN are same and this confirms that our database is consistent state and does not requires any kind of recovery.
So when you use shutdown immediate or Shutdown normal, Database goes down with consistent state. However, this is not the case with “Shutdown Abort” statement.

To understand fully, We need to understand actual working of “Shutdown” process,  SCN ,Checkpoint_Change#, Last_Change#.

Shutdown Process:
Shutdown Immediate/Shutdown Normal: This is the most clean method to take database down. Using this statement, DB goes down in consistent state. Following are the stages:

Shutdown Immediate:

SQL> select open_mode from v$database;

OPEN_MODE
----------
READ WRITE




SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.

Internally, First Oracle closes all the Background (SMON, PMON, CHKPT, DBWR, CJQ0,MMNL,MMON) processes . Secondly if Archival is enabled then it stops. Thirdly, All  uncommitted transaction are rolled back, further logons stops. Checkpoint  happens, All the blocks ( Changed, Committed)  are flushed to Hard disk.  If any trigger (System Trigger) is written on BEFORE SHUTDOWN system event than it fires.
Instances withers away, All Locks, Enaques  (TX) are released. “ALTER DATABASE CLOSE NORMAL” and “ALTER DATABASE DISMOUNT” statements are executed. After all these happenings our database is now SHUT.
In this scenario, Before closing down Oracle does CHECKPOINT that makes database state consistent. Means no Online REDO logs will be required to open database. Once you open the database Checkpoint_Change#  of Database (Controlfile),Datafile(s) and Datafile_Headers will be same.

Checkpoint change number of controlfile.

SQL> Select Checkpoint_Change#,name from v$database;
CHECKPOINT_CHANGE#                                               NAME
----------------------------                                 -------------
1202269                                                                SEED

Checkpoint change number of Datafile(s).

SQL>Select Checkpoint_Change#,name,last_change# from v$datafile;
CHECKPOINT_CHAGNE#   NAME                                                                                                                               Last_Change#  
---------------------                             ---------------------------------                                                        -------------------------------
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSTEM01.DBF   1151635
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\UNDOTBS01.DBF               1151635
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSAUX01.DBF   1151635
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\USERS01.DBF        1151635
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\EXAMPLE01.DBF                  1151635
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\RMAN_TBS01.DBF1151635

Checkpoint Change number in Header of Datafile(s).

SQL>Select dh.checkpoint_change#,df.name from v$datafile_header dh,v$datafile df
where df.file#=dh.file#





CHECKPOINT_CHAGNE#   NAME                                                               CHECKPOINT_CHAGNE#                                                                              
---------------------                             ---------------------------------      ----------------------------------------
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSTEM01.DBF  
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\UNDOTBS01.DBF              
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSAUX01.DBF  
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\USERS01.DBF     
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\EXAMPLE01.DBF               
1202269                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\RMAN_TBS01.DBF           

In this process, Every data file has the same CHECKPOINT_CHANGE# number as DATABASE has. So database does not need any kind of recovery (INSTANCE) recovery.

Now! See how SHUTDOWN ABORT just like a crash and why it needs recovery.


SQL> shutdown abort;
ORACLE instance shut down.

In  this scenario, Database has not done anything to make things consistent instead it is just crashed.
However, unlike Media recovery where we need not to apply Archived REDO logs all the changes stored in Online REDO Log files will be re-applied and instance will recover itself.
In Other words START SCN != STOP SCN so recovery is required.



SQL> startup mount;
ORACLE instance started.

Total System Global Area  612368384 bytes
Fixed Size                  1250428 bytes
Variable Size             230689668 bytes
Database Buffers          373293056 bytes
Redo Buffers                7135232 bytes
Database mounted.

SQL> Select name,Checkpoint_Change# from v$database;

NAME      CHECKPOINT_CHANGE#
---------                    ------------------
SEED                 1260508
SQL>Select Checkpoint_Change#,name,last_change# from v$datafile;
CHECKPOINT_CHAGNE#   NAME                                                                                                                               Last_Change#  
---------------------                             ---------------------------------                                                        -------------------------------
1260508                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSTEM01.DBF                   Null
1260508                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\UNDOTBS01.DBF                              Null
1260508                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\SYSAUX01.DBF                   Null
1260508                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\USERS01.DBF                      Null
1260508                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\EXAMPLE01.DBF                               Null
1260508                                D:\ORACLE\PRODUCT\10.2.0\ORADATA\SEED\RMAN_TBS01.DBF            Null


Next time DBA starts up the database, it with out asking any recovery will automatically apply contents of Online REDO log files (ROLLFORD) and will open the database.
SQL> startup
ORACLE instance started.

Total System Global Area  612368384 bytes
Fixed Size                  1250428 bytes
Variable Size             234883972 bytes
Database Buffers          369098752 bytes
Redo Buffers                7135232 bytes
Database mounted.
Database opened.


I have tried to put the instance recovery process in clear and precise words, but still if anything is missing then any correction, suggestion are most welcome.

3 comments:

  1. Hi,
    We have faced an issue where the database is throwing the below error.

    SQL> alter database open;
    alter database open
    *
    ERROR at line 1:
    ORA-01113: file 1 needs media recovery
    ORA-01110: data file 1: '/txb7/appl/oracle/isdorchr/system01.dbf'

    But my finding is the checkpoint scn is same for all the datafiles.

    SQL> select checkpoint_change# from v$database;

    CHECKPOINT_CHANGE#
    -------------------------
    11933366629687

    SQL> select distinct checkpoint_change# from v$datafile_header;

    CHECKPOINT_CHANGE#
    -------------------------
    11933366629687

    SQL> select controlfile_change# from v$database;

    CONTROLFILE_CHANGE#
    -------------------------
    11933366629687

    Please guide me why these are same as according to my understanding since the datafile is asking for recovery the checkpoint scn's should be different.

    ReplyDelete
  2. Select Checkpoint_Change#,name,last_change# from v$datafile

    what output is given by this query?

    ReplyDelete
  3. SQL> alter database datafile 'e:\app\dell\oradata\sample11\users02.dbf' offline;

    Database altered.


    SQL> insert into scott.emp(empno,ename) values (1211,'jaiddd');

    1 row created.

    SQL> commit;

    Commit complete.

    SQL> alter system switch logfile;

    System altered.

    SQL> alter system switch logfile;

    System altered.

    SQL> alter database datafile 'e:\app\dell\oradata\sample11\users02.dbf' online;
    alter database datafile 'e:\app\dell\oradata\sample11\users02.dbf' online
    *
    ERROR at line 1:
    ORA-01113: file 6 needs media recovery
    ORA-01110: data file 6: 'E:\APP\DELL\ORADATA\SAMPLE11\USERS02.DBF'


    at this point of time
    Select name,checkpoint_change# from v$database;

    Name Checkpoint_change#
    ------------ ------------------------
    SAMPLE11 1869152

    select name,checkpoint_change#,last_change# from v$datafile Where name like '%USERS02.DBF'

    NAME CHECKPOINT_CHANGE# LAST_CHANGE#
    -------------------- ---------------------- ---------------------
    E:\APP\DELL\ORADATA\SAMPLE11\USERS02.DBF 1869152 1869613 ----NOT SAME


    SQL> RECOVER DATAFILE 6;
    Media recovery complete.



    Select name,checkpoint_change# from v$database;

    Name Checkpoint_change#
    ------------ ------------------------
    SAMPLE11 1869648

    select name,checkpoint_change#,last_change# from v$datafile Where name like '%USERS02.DBF'

    NAME CHECKPOINT_CHANGE# LAST_CHANGE#
    -------------------- ---------------------- ---------------------
    E:\APP\DELL\ORADATA\SAMPLE11\USERS02.DBF 1869648 1869648 ---- SAME

    I hope it explains..

    please let me know if you need more information this.

    ReplyDelete