Build Your Own Oracle Extended RAC Cluster on Oracle VM and Oracle Enterprise Linux

by Jakub Wartak
Updated October 2008

The information in this guide is not validated by Oracle, is not supported by Oracle, and should only be used at your own risk; it is for educational purposes only.

9. Testing Failover Capabilities

Using a free, GPL-d Java application that I wrote called orajdbcstat (in beta as of this writing), you can rather simply observe how potential clients would switch between instances in case of failures (as well as measure performance).

Instance Failure on ERAC1

First, run orajdbcstat on an idle database:

[vnull@xeno orajdbcstat]$  
                                        
./orajdbcstat.sh -d ERAC -i ERAC1 -i ERAC2 1
         ------ ------conntime------ -----------stmt-------------------
         tns    rac# tcpping    jdbc rac#  ins  sel      upd  del plsql
15:35:22 ERAC      2       0      49    1    3    0        2    2     0
15:35:22 +ERAC1    1       0      45    1   19    0        2    2     0
15:35:22 +ERAC2    2       0      60    2    4    0        2   14     0
15:35:23 ERAC      1       0      44    1    3    0        2    3     0
15:35:23 +ERAC1    1       0      65    1    2    0        2    3     0
15:35:23 +ERAC2    2       0      46    2    3    0        3    2     0
                                         
[..utility still running, output trimmed for clarity..]

You can see from the output that a main thread (“ERAC” TNS descriptor) is connected to the instance with number 1. “conntime” show us new connection timing to the RAC instances and the main TNS descriptor (with failover capabilities; notice that “tcpping” is currently unimplemented). Below the “stmt” we can monitor connection from FCF (Fast Connection Failover) pool - current RAC instance number and simple SQL statements timings. The final “1” at the command line forces orajdbcstat to output statistics every second.

Perform the following command to simulate a crash of RAC instance #1:

[oracle@rac1 ~]$  
                                        
srvctl stop instance -d erac -i erac1 -o abort
[oracle@rac1 ~]$

On the SSH session with orajdbcstat, you get the following output (described under the excerpt):

15:36:55 ERAC      1       0      45        1    3    0        1    3     0
15:36:55 +ERAC1    1       0      42        1   43    0        2    2     0
15:36:55 +ERAC2    2       0      49        2    3    0        4    2     0
         ------     ------conntime------ -----------stmt---------------
         tns    rac# tcpping    jdbc rac#  ins  sel      upd  del plsql
15:36:56 ERAC      2       0      49 1->X [        E!17008            ]
15:36:56 +ERAC1   -1 E!01092 E!01092 1->X [            E!17008        ]
15:36:56 +ERAC2    2       0      67    2   17    0        4    2     0
15:36:57 ERAC      2       0      46 X->2 [        E!17008        ]
15:36:57 +ERAC1   -1 E!12521 E!12521    X [        E!17008        ]
15:36:57 +ERAC2    2       0      67    2   17    0        4    2     0
15:36:58 ERAC      2       0      46 X->2 [        E!17008        ]
15:36:58 +ERAC1   -1 E!12521 E!12521    X [        E!17008        ]
15:36:58 +ERAC2    2       0      67    2   17    0        4    2     0
15:36:59 ERAC      2       0      46 X->2 [        E!17008        ]
15:36:59 +ERAC1   -1 E!12521 E!12521    X [        E!17008        ]
15:36:59 +ERAC2    2       0      67    2   17    0        4    2     0
15:37:00 ERAC      2       0      46 X->2 [        E!17008        ]
15:37:00 +ERAC1   -1 E!12521 E!12521    X [        E!17008        ]
15:37:00 +ERAC2    2       0      67    2   17    0    4        2     0
15:37:01 ERAC      2       0      56    2   12    0        7    3     0
15:37:01 +ERAC1   -1 E!12521 E!12521    X [        E!17008        ]
15:37:01 +ERAC2    2       0      51    2  131    0        5    3     0
15:37:02 ERAC      2       0      59    2  178    0       17   29     0
15:37:02 +ERAC1   -1 E!12521 E!12521    X [        E!17008        ]
15:37:02 +ERAC2    2       0      73    2  121    0      203   36     0
         ------     ------conntime------ -----------stmt---------------
         tns    rac# tcpping    jdbc rac#  ins  sel      upd  del plsql
15:37:03 ERAC      2       0      68    2    2    0        3    2     0
15:37:03 +ERAC1   -1 E!12521 E!12521    X [        E!17008        ]
15:37:03 +ERAC2    2       0      45    2    3    0        2    3     0
15:37:04 ERAC      2       0      48    2    7    0        3    2     0
15:37:04 +ERAC1   -1 E!12521 E!12521    X [        E!17008        ]
15:37:04 +ERAC2    2       0      86    2    2    0        3    4     0
15:37:05 ERAC      2       0      47    2    2        0    3    2     0
15:37:05 +ERAC1   -1 E!12521 E!12521    X [        E!17008        ]
15:37:05 +ERAC2    2       0      53    2    3    0        3    3     0
15:37:06 ERAC      2       0      48    2    3    0        2    2     0
15:37:06 +ERAC1   -1 E!12521 E!12521    X [        E!17008        ]
15:37:06 +ERAC2    2       0      46    2   10    0        2   10     0
15:37:07 ERAC      2       0      48    2    2    0        3    3     0
15:37:07 +ERAC1   -1 E!12521 E!12521    X [        E!17008        ]
15:37:07 +ERAC2    2       0      83    2   10    0        3    2     0
15:37:08 ERAC      2       0      48    2    3    0        2    2     0
15:37:08 +ERAC1   -z E!12521 E!12521    X [        E!17008        ]
15:37:08 +ERAC2    2       0      50    2    2    0        3    2     0
15:37:09 ERAC      2       0      48    2    2    0        2    2     0
15:37:09 +ERAC1   -1 E!12521 E!12521    X [        E!17008        ]
15:37:09 +ERAC2    2       0      44    2    3    0        2    3     0
[..utility still running, output trimmed for clarity..]

The following happened:

At 15:36:56 all new connections to the ERAC1 start returning errors, and all connections in the FCF pool for ERAC1 halt their normal actions. The FCF pool for main database (ERAC TNS descriptor) detects failure – and failover starts.
At 15:36:57 (failure + 1 second) the main FCF pool starts reconnecting to the RAC#2 instance (“X->2”).
Between 15:37:00 and 15:37:01 (failure + 4 or 5 seconds) the FCF pool connections are reestablished to the surviving ERAC2 instance.

The “E!NNNNN” string indicates that the SQL connection got error ORA-NNNNN. (To decode the error please use the oerrutility.) After recovery (starting instance ERAC1):

15:47:37 ERAC      2       0      64    2    3    0        2    3     0
15:47:37 +ERAC1   -1 E!01033 E!01033    X [         E!17008           ]
15:47:37 +ERAC2    2       0      57    2    8    0       13    1     0
15:47:38 ERAC      2       0      54    2    3    0        3    2     0
15:47:38 +ERAC1    1       0     481 X->1 [         E!17008           ]
15:47:38 +ERAC2    2       0      52    2    3    0        2    4     0
         ------     ------conntime------ -----------stmt---------------
         tns    rac# tcpping    jdbc rac#  ins  sel      upd  del plsql
15:47:39 ERAC      2       0      54    2    3    0        3    2     0
15:47:39 +ERAC1    1       0     167    1   14    0        4    2     0
15:47:39 +ERAC2    2       0      56    2    8    0       27    3     0

As you can see FCF pool on ERAC1 detects that the instance is up and starts working again without manual intervention.

Catastrophic Site Failure

The main purpose of building Extended RAC clusters is the ability to survive a catastrophic site failure (RAC nodes and SAN crash). Again, launch orajdbcstat from workstation:

------ ------conntime------     -----------stmt---------------
         tns    rac# tcpping    jdbc rac#  ins  sel      upd  del plsql
17:24:18 ERAC      1       0      46     
                                        
2   11    0        2    2     0
17:24:18 +ERAC1    1       0      45    1    4    0        3    2     0
17:24:18 +ERAC2    2       0      46    2    4    0        3    3     0
17:24:19 ERAC      1       0      45     
                                        
2   2    0        4    3     0
17:24:19 +ERAC1    1       0      44    1   15    0        3    2     0
17:24:19 +ERAC2    2       0      44    2    3    0        3    3     0

As you can see the main FCF pool is using RAC#2 instance, so you are going to crash entire #2 site (iscsi2 and rac2 nodes):

[root@quadovm ~]#  
                                       
xm list
                                        


Name                                        ID   Mem VCPUs      State   Time(s)
132_rac1                                     1  2048     1     -b----   1845.0
134_rac2                                     6  2048     1     r-----    152.2
Domain-0                                     0   512     4     r-----    665.3
iscsi1                                       3   768     1     -b----    194.0
iscsi2                                       5   768     1     -b----     44.3
[root@quadovm ~]#  
                                        
xm destroy 134_rac2 && date && xm destroy iscsi2 && date
Sat May 31 17:25:55 CEST 2008
Sat May 31 17:25:56 CEST 2008
[root@quadovm ~]#

Now back to the orajdbcstat session:

17:25:56 ERAC      2       0      44        2    3    0    3     2     0
17:25:56 +ERAC1    1       0      79        1    8    0    3     2     0
17:25:56 +ERAC2    2       0      44        2   11    0    3     2     0
17:25:57 ERAC      2       0      44        2    3    0    3     2     0
17:25:57 +ERAC1    1       0      79        1    8    0    3     2     0
17:25:57 +ERAC2    2       0      44        2   11    0    3     2     0
17:25:58 ERAC      2       0      44        2    3    0    3     2     0
17:25:58 +ERAC1    1       0      79        1    8    0    3     2     0
17:25:58 +ERAC2   -1 E!12170 E!12170        2   11    0    3     2     0
17:25:59 ERAC      2       0      44        2    3    0    3     2     0
17:25:59 +ERAC1    1       0      79        1    8    0    3     2     0
17:25:59 +ERAC2   -1 E!12170 E!12170        2   11    0    3     2     0
         ------     ------conntime------ -----------stmt---------------
         tns    rac# tcpping    jdbc      rac#  ins  sel  upd  del plsql
17:26:00 ERAC      2       0      44        2    3    0    3     2     0
17:26:00 +ERAC1    1       0      79        1    8    0    3     2     0
17:26:00 +ERAC2   -1  E!12170 E!12170       2   11    0    3     2     0
17:26:01 ERAC      2       0      44        2    3    0    3     2     0
17:26:01 +ERAC1    1       0      79        1    8    0    3     2     0
17:26:01 +ERAC2   -1 E!12170 E!12170        2   11    0    3     2     0
17:26:02 ERAC      2       0      44        2    3    0    3     2     0
17:26:02 +ERAC1    1       0      79        1    8    0    3     2     0
17:26:02 +ERAC2   -1 E!12170 E!12170        2   11    0    3     2     0
17:26:03 ERAC      2       0      44        2    3    0    3     2     0
17:26:03 +ERAC1    1       0      79        1    8    0    3     2     0
17:26:03 +ERAC2   -1 E!12170 E!12170     2->X [         E!03113        ]
17:26:04 ERAC     -1 E!12170 E!12170        2    3    0    3     2     0
17:26:04 +ERAC1    1       0      43        X [         E!03113        ]
17:26:04 +ERAC2   -1 E!12170 E!12170        2->X [      E!03113        ]
17:26:05 ERAC     -1 E!12170 E!12170        2    3    0    3     2     0
17:26:05 +ERAC1    1       0      43     X->1 [         E!03113        ]
17:26:05 +ERAC2   -1 E!12170 E!12170     2->X [         E!03113        ]
17:26:06 ERAC     -1 E!12170 E!12170        2    3    0    3     2     0
17:26:06 +ERAC1    1       0      43     X->1 [         E!03113        ]
17:26:06 +ERAC2   -1 E!12170 E!12170     2->X [         E!03113        ]

You can clearly see that the whole Extended RAC is blocked (even the ERAC1 instance). This is due to the fact that ERAC1 instance (mainly DBWR and LGWR) is still trying to retry writes to iSCSI storage on iscsi2 (which is not available). When a iSCSI timeout occurs on TCP iSCSI connections, I/O errors are returned to the Oracle software (Clusterware, ASM, and Database).

As you saw earlier, we have configured iSCSI stack on rac1 and rac2 to retry for 120 seconds (parameter node.session.timeo.replacement_timeout in /etc/iscsi/iscsid.conf). For this 120 seconds, the kernel is retrying I/O operations while all applications using filedescriptors(fd) on storage seem to hang. Further detailed information can be observed in output of the dmesg command and ASM alert logfile.

After some time:

17:28:17 ERAC     -1 E!12571 E!12571    X [        E!03113        ]
17:28:17 +ERAC1    1       0      43 X->1 [        E!03113        ]
17:28:17 +ERAC2   -1 E!12571 E!12571    X [        E!03113        ]
17:28:18 ERAC     -1 E!12571 E!12571    X [        E!03113        ]
17:28:18 +ERAC1    1       0      75    X [        E!03113        ]
17:28:18 +ERAC2   -1 E!12571 E!12571    X [        E!03113        ]
17:28:19 ERAC     -1 E!12571 E!12571    X [        E!03113        ]
17:28:19 +ERAC1    1       0      91 X->1    29    0   23  23     0
17:28:19 +ERAC2   -1 E!12571 E!12571    X [        E!03113        ]
17:28:20 ERAC     -1 E!12571 E!12571    X [        E!03113        ]
17:28:20 +ERAC1    1       0      43    1     8    0   4    1     0
17:28:20 +ERAC2   -1 E!12571 E!12571    X [        E!03113        ]
17:28:21 ERAC     -1 E!12571 E!12571    X [        E!03113        ]
17:28:21 +ERAC1    1       0      42    1     8    0   4    2     0
17:28:21 +ERAC2   -1 E!12571 E!12571    X [        E!03113        ]
17:28:22 ERAC     -1 E!12571 E!12571 X->1     4    0   8    9     0
17:28:22 +ERAC1    1       0      42    1     7    0   2    2     0
17:28:22 +ERAC2   -1 E!12571 E!12571    X [        E!03113        ]
         ------------conntime------ -----------stmt---------------
         tns    rac# tcpping    jdbc rac#   ins  sel upd  del plsql
17:28:23 ERAC      1      0       45    1     3    0   3    3     0
17:28:23 +ERAC1    1      0       43    1     2    0   3    2     0
17:28:23 +ERAC2   -1 E!12571 E!12571    X [        E!03113        ]
17:28:24 ERAC      1       0      43    1     2    0   2    2     0
17:28:24 +ERAC1    1       0      43    1     2    0   2    2     0
17:28:24 +ERAC2   -1 E!12571 E!12571    X [        E!03113        ]
17:28:25 ERAC      1       0       44    1   19    0   3    2     0
17:28:25 +ERAC1    1       0       43    1    2    0   3    2     0
17:28:25 +ERAC2   -1 E!12571 E!12571    X [        E!03113        ]
17:28:26 ERAC      1       0       44    1    3    0   2    3     0
17:28:26 +ERAC1    1       0       43    1    2    0   1    2     0
17:28:26 +ERAC2   -1 E!12571 E!12571    X [        E!03113        ]
17:28:27 ERAC      1       0       43    1    3    0   1    2     0
17:28:27 +ERAC1    1       0       42    1    2    0   3    2     0
17:28:27 +ERAC2   -1 E!12571 E!12571    X [        E!03113        ]
17:28:28 ERAC      1       0       43    1    8    0    3   2     0
17:28:28 +ERAC1    1       0       43    1    4    0   3    2     0
17:28:28 +ERAC2   -1 E!12571 E!12571    X [        E!03113        ]

This is what happened here:

Between 17:28:17 and 17:28:19 (failure +2 minutes 14 seconds) the ERAC1 TNS is able to reestablish new connections.
At 17:28:19 (failure +2 minutes 16 seconds), the ERAC1 FCF pool is working fine.
At 17:28:19 until 17:28:22 (failure +2 minutes 16 seconds), the master ERAC FCF pool tries to fail-over to ERAC1.
At 17:28:22 (failure +2 minutes 19 seconds), the master ERAC FCF pool is successfully fail-overed to ERAC1. All connections are now working well. Failgroup DATA1_0001 changes its disk mount state from CACHED to MISSING.

SQL>  
                                       
col path format a36
                                        
col diskgroup format a10
SQL>  
                                       
col failgroup format a10
SQL>

SELECT path, dg.name diskgroup, failgroup, mount_status, mode_status FROM v$asm_disk d JOIN v$asm_diskgroup dg ON (d.group_number=dg.group_number) ORDER BY 1;

PATH                                 DISKGROUP  FAILGROUP  MOUNT_S MODE_ST
------------------------------------ ---------- ---------- ------- -------
/dev/iscsi/racdata1.asm1/lun0/part1       DATA1 DATA1_0000 CACHED  ONLINE
                                          DATA1 DATA1_0001 MISSING OFFLINE
SQL>

Latency Impact and Performance Degradation

Warning: These tests are designed for the purposes of this guide only; they are not true, real-world tests and should not be considered a true measure of Extended RAC cluster performance.

The critical factor for Extended RAC configuration is the distance between RAC nodes and storage arrays. To overcome the problems with simulating latencies without using pricey hardware, we are going to use Linux's built-in network traffic shaping ability (sometimes called simply QoS – Quality of Source). As our SAN is based on iSCSI, we can simply introduce arbitrary latencies, packet reordering, and packet drops on the Ethernet layer. (It of course applies also to NFS and potentially to currently developed FcoE.) The same goes for the interconnect between RAC nodes.

In this scenario, you are going to configure the internal Xen network devices in dom0 to introduce delays between VMs. The main utility for configuring Linux traffic shaping is called “tc” (for Traffic Control). The problem arises when you start poking into the details. The standard Oracle VM Server dom0 kernel is running at 250HZ, which simply means that all time-related stuff in kernel of dom0 is done in 4ms intervals. Latencies introduced by Linux netem queuing discipline are dependent on the kernel's HZ, so in order to simulate 1ms delays you have to use a kernel with a different HZ setting (at least 1,000HZ to achieve adequate simulation of 1ms latencies).

As kernel compilations can be cumbersome, I advise you to download a ready-to-run dom0 kernel RPM from here. The procedure for building dom0 Xen kernels is briefly described later in the “Building the Kernel for Oracle VM Server 2.1 Xen dom0” section. The procedure for installing a new kernel is also described there.

A simple script to configure Xen network interfaces to introduce latency is named qosracinit.pl and was run in dom0.

After changing variables to reflect your's environment simply execute:

[root@quadovm ~]#  
                                        
./qosracinit.pl > /tmp/qos && bash /tmp/qos
[root@quadovm ~]#

In order to clear QoS rules to avoid any latency simply execute:

[root@quadovm ~]#  
                                        
./qosracinit.pl     clear > /tmp/qos && bash /tmp/qos
[root@quadovm ~]#

Tests were performed using Dominic Giles' Swingbench 2.3 Order Entry benchmark. Swingbench is an easy setup for a RAC environment. “Min Think” and “Max Think” times were set to 0 to exploit the full transaction processing potential of this installation. (“Think” time is the amount of time needed to delay the next transaction.)

Two load generators (minibench) configured with 15 JDBC connections were stressing individual RAC nodes. Load generators were running from “xeno” workstation. Real database size as reported for SOE user was:

SQL>  
                                        
conn soe/soe
Connected.
SQL>  
                                       
select sum(bytes)/1024/1024 mb from user_segments;
                                       

        MB
----------
2250.72656
SQL>

Key parameters were setup to the following artificial values:

sga_target=1G
db_cache_size = 512M
pga_aggregate_target=256M
memory_target = 0 (disabled)

Low db_cache_size setting has been set to force Oracle database to stress interconnect and SAN, and expose them to latency impact.

Without Latency

First we measure real latency between Vms. (Note: all ping tests were collected during full Swingbench load.):

[oracle@rac2 ~]$  
                                       
ping -c 10 -i 0.2 -q -s 1200 10.97.1.1; ping -c 10 -i 0.2 -q -s 1200 10.98.1.101; ping -c 10 -i 0.2 -q -s 1200 10.98.1.102
PING 10.97.1.1 (10.97.1.1) 1200(1228) bytes of data.
                                       

--- 10.97.1.1 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 1816ms
rtt min/avg/max/mdev = 0.067/
                                      
0.091/0.113/0.018 ms
PING 10.98.1.101 (10.98.1.101) 1200(1228) bytes of data.
                                         

--- 10.98.1.101 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 1804ms
rtt min/avg/max/mdev = 0.083/
                                        
0.106/0.132/0.020 ms
PING 10.98.1.102 (10.98.1.102) 1200(1228) bytes of data.
                                        

--- 10.98.1.102 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 1799ms
rtt min/avg/max/mdev = 0.079/
                                        
0.108/0.193/0.034 ms
[oracle@rac2 ~]$

This test shows latency between the RAC2 node and:

10.97.1.1 (RAC1 via interconnect)
10.98.1.101 (iscsi1 openfiler via SAN)
10.98.1.102 (iscsi2 openfiler via SAN)

using 1,200 bytes ICMP echo packets. We are interested in avg round-trip time as this is the critical factor for Extended RAC. As you can see below, the benchmark yields ~5460 TPM:

[oracle@rac1 ~]$  
                                       
sqlplus -s / as sysdba @tpm
6356
5425
4924
5162
5430
Average = 5459.4
                                         

PL/SQL procedure successfully completed.
                                        

[oracle@rac1 ~]$

Next, try to introduce latency and observe its real impact on the training system. The tpm.sql script for doing so is shown below:

set termout on
set echo off
set serveroutput on
DECLARE
        val1 NUMBER;
        val2 NUMBER;
        diff NUMBER;
        average NUMBER := 0;
        runs NUMBER := 5;
BEGIN
        FOR V IN     1..runs LOOP
                SELECT SUM(value) INTO val1
                  FROM gv$sysstat WHERE name IN ('user commits','transaction rollbacks');
                                         

                DBMS_LOCK.SLEEP(60);
                                         

                SELECT SUM(value) INTO val2
                  FROM  gv$sysstat WHERE name IN ('user commits','transaction rollbacks');
                                         

                diff := val2-val1;
                average     := average + diff;
                DBMS_OUTPUT.PUT_LINE(diff);
          END LOOP;
          DBMS_OUTPUT.PUT_LINE('Average     = ' || average/runs);
  END;
  /
                                        

exit

1ms Artificial Latency

As you did earlier, first measure real avg round-trip time:

[oracle@rac2 ~]$  
                                        
ping -c 10 -i 0.2 -q -s 1200 10.97.1.1; ping -c 10 -i 0.2 -q -s 1200 10.98.1.101; ping
                                         

 -c 10 -i 0.2 -q -s 1200 10.98.1.102
                                        
PING 10.97.1.1 (10.97.1.1) 1200(1228) bytes of data.
                                         

--- 10.97.1.1 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 1809ms
rtt min/avg/max/mdev = 2.693/
                                       
3.482/3.863/0.389 ms
PING 10.98.1.101 (10.98.1.101) 1200(1228) bytes of data.
                                        

--- 10.98.1.101 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 1854ms
rtt min/avg/max/mdev = 2.481/
                                        
3.850/6.621/1.026 ms
PING 10.98.1.102 (10.98.1.102) 1200(1228) bytes of data.
                                         

--- 10.98.1.102 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 1812ms
rtt min/avg/max/mdev = 0.080/
                                       
0.135/0.233/0.051 ms
[oracle@rac2 ~]$

From the output above you can conclude that after adding 1ms to the output network queues of Linux, the real average round-trip time increased to ~3.5 – 3.8 ms. This is due to the fact the the outgoing ICMP echo request packet is delayed approximately by 1ms, and then the reply is also delayed from the responding side by 1ms. The remaining unwanted ~2ms probably are due to Xen scheduling context switches on the overbooked system during full system load. (Note: you are simulating four VMs, but in reality the real IO work is being done by dom0 - 5th VM running on 4x CPU machine.)

[oracle@rac1 ~]$  
                                      
sqlplus -s / as sysdba @tpm
5173
5610
5412
5094
5624
Average = 5382.6
                                        

PL/SQL procedure successfully completed.
[oracle@rac1 ~]$

3ms Artificial Latency

[oracle@rac2 ~]$  
                                       
ping -c 10 -i 0.2 -q -s 1200 10.97.1.1; ping -c 10 -i 0.2 -q -s 1200 10.98.1.101; ping
                                         

-c 10 -i 0.2 -q -s 1200 10.98.1.102
                                        
                                      

PING 10.97.1.1 (10.97.1.1) 1200(1228) bytes of data.
                                        

--- 10.97.1.1 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 1819ms
rtt min/avg/max/mdev = 6.326/
                                       
7.631/9.839/0.881 ms
PING 10.98.1.101 (10.98.1.101) 1200(1228) bytes of data.
                                        

--- 10.98.1.101 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 1806ms
rtt min/avg/max/mdev = 6.837/
                                       
7.643/8.544/0.426 ms
PING 10.98.1.102 (10.98.1.102) 1200(1228) bytes of data.
                                         

--- 10.98.1.102 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 1801ms
rtt min/avg/max/mdev = 0.076/
                                        
0.149/0.666/0.172 ms
[oracle@rac2 ~]$

From the observation above, you can conclude that 1.5-2ms are added by Xen scheduling and overbooking the system (7.5 [ms] – 2x3 [ms] = 1.5 [ms]).

[oracle@rac1 ~]$  
                                        
sqlplus -s / as sysdba @tpm
5489
4883
5122
5512
4965
Average = 5194.2
                                        

PL/SQL procedure successfully completed.
                                       

[oracle@rac1 ~]$

We can see performance degradation ~5200 TPM vs ~5460 TPM (without latencies) in artificial testing.

10. Troubleshooting and Miscellaneous

This section is dedicated to different problems which can occur while implementing Extended RAC architecture.

Avoiding ORA-12545 While Connecting to RAC

Sometimes when connecting to newly configured RAC clients you will get error ORA-12545 (“Connect failed because target host or object does not exist”). In order to solve that, alter LOCAL_LISTENER parameter individually for each instance.

SQL>  
                                        
ALTER SYSTEM SET  local_listener='(ADDRESS=(PROTOCOL=TCP)(HOST=10.99.1.91)(PORT=1521))'  SID='erac1';
                                         

System altered.
                                         

SQL>  
                                       
ALTER SYSTEM SET  local_listener='(ADDRESS=(PROTOCOL=TCP)(HOST=10.99.1.92)(PORT=1521))'  SID='erac2';
System altered.

Alternatively you could set up DNS and register RAC nodes in it or reconfigure client to resolve RAC hostnames into IP addresses -- e.g. by adding this to the /etc/hosts on UNIX-like JDBC clients:

10.99.1.191     vmrac1-vip vmrac1
10.99.1.192     vmrac2-vip vmrac2

Building the Kernel for Oracle VM Server 2.1 Xen dom0

My compiling experience indicates that in order to self-compile a dom0 kernel you should have the following RPMs uploaded to Oracle VM:

[vnull@xeno Downloads]$  
                                     
scp -r RPMS_OVM21_kernel_compile root@10.99.1.2:.
root@10.99.1.2's password:
m4-1.4.5-3.el5.1.i386.rpm                                    100%  133KB 133.2KB/s   00:00
rpm-build-4.4.2-37.el5.0.1.i386.rpm                          100%  547KB 547.5KB/s   00:00
kernel-2.6.18-8.1.6.0.18.el5.src.rpm                         100%   48MB   9.6MB/s   00:05
kernel-headers-2.6.18-8.el5.i386.rpm                         100%  723KB 723.5KB/s   00:00
glibc-devel-2.5-12.i386.rpm                                  100% 2034KB   2.0MB/s   00:00
elfutils-0.125-3.el5.i386.rpm                                100%  164KB 163.7KB/s   00:00
glibc-headers-2.5-12.i386.rpm                                100%  605KB 604.6KB/s   00:00
patch-2.5.4-29.2.2.i386.rpm                                  100%   64KB  64.0KB/s   00:00
redhat-rpm-config-8.0.45-17.el5.0.1.noarch.rpm               100%   52KB  52.5KB/s   00:00
libgomp-4.1.1-52.el5.i386.rpm                                100%   69KB  69.3KB/s   00:00
cpp-4.1.1-52.el5.i386.rpm                                    100% 2673KB   2.6MB/s   00:01
gcc-4.1.1-52.el5.i386.rpm                                    100% 5067KB   5.0MB/s   00:00
elfutils-libs-0.125-3.el5.i386.rpm                           100%  105KB 105.2KB/s   00:00
[vnull@xeno Downloads]$

Next, after copying installation packages, install them on OracleVM Server:

[root@quadovm  RPMS_OVM21_kernel_compile]#  
                                        
rpm -Uhv *.rpm
                                         
[..]
[root@quadovm ~]#  
                                        
cd /usr/src/redhat/SPECS/
[root@quadovm SPECS]#  
                                        
vi kernel-2.6.spec

In kernel-2.6.spec you should set the following variables:

%define buildboot 0
%define buildxenovs 1

[root@quadovm SPECS]#  
                                       
rpmbuild -bp --target=`uname -m` kernel-2.6.spec
Building target platforms: i686
Building for target i686
Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.29959
+ umask 022
[..]
[root@quadovm SPECS]#  
                                       
cd ../BUILD/kernel-2.6.18/linux-2.6.18.i686/
[root@quadovm linux-2.6.18.i686]#  
                                       
grep HZ .config
# CONFIG_HZ_100 is not set
                                       
CONFIG_HZ_250=y
                                         
# CONFIG_HZ_1000 is not set
                                       
CONFIG_HZ=250
CONFIG_MACHZ_WDT=m
CONFIG_NO_IDLE_HZ=y
[root@quadovm linux-2.6.18.i686]#  
                                      
vi .config

Ensure that HZ is set to 1,000Hz by editing .config to look like the following:

# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000

Edit Makefile to differentiate uour new kernel and build it:

[root@quadovm linux-2.6.18.i686]#  
                                      
vi Makefile
                                         
change  
                                         
EXTRAVERSION to e.g.:  
                                         
-8.1.6.0.18.el5xen_vnull03
                                       
[root@quadovm linux-2.6.18.i686]#  
                                        
make oldconfig
[..]
[root@quadovm linux-2.6.18.i686]#  
                                        
make config
[..disable KERNEL DEBUG!...]
[root@quadovm linux-2.6.18.i686]#  
                                       
make rpm
scripts/kconfig/conf -s arch/i386/Kconfig
[..]

The new build kernel should be ready in /usr/src/redhat/RPMS/i386 directory.

Installing the New Kernel

Installing the new kernel is routine, but as our self compiled kernel doesn't use grubby installer, some kernel installation steps have to be performed manually.

[root@quadovm ~]#  
                                        
rpm -ihv kernel-2.6.188.1.6.0.18.el5xen_vnull03-1.i386.rpm
Preparing...                ########################################### [100%]
   1:kernel                 ########################################### [100%]
[root@quadovm ~]#  
                                       
depmod -a 2.6.18-8.1.6.0.18.el5xen_vnull03
[root@quadovm ~]#  
                                       
mkinitrd  /boot/initrd-2.6.18-8.1.6.0.18.el5xen_vnull03.img 2.6.18-8.1.6.0.18.el5xen_vnull03

Next you have to alter GRUB bootloader to boot your new kernel

[root@quadovm ~]#  
                                       
cd /boot/grub
[root@quadovm grub]#  
                                        
vi menu.lst

First, ensure that default is set to “0” in menu.lst file: default=0. This will by default boot the first kernel entry found in menu.lst . Put the following GRUB kernel configuration entry as first (just before any other):

title Oracle VM Server vnull03
        root (hd0,0)
        kernel /xen.gz console=ttyS0,57600n8 console=tty dom0_mem=512M
        module /vmlinuz-2.6.18-8.1.6.0.18.el5xen_vnull03 ro root=/dev/md0
        module /initrd-2.6.18-8.1.6.0.18.el5xen_vnull03.img

Quick Recovery From Catastrophic Site Failure

This brief procedure explains how to make iscsi2 array and rac2 node fully functional after a simulated disaster. Note: in this scenario you have lost OCRmirror. (it was located on iscsi2's LUN.)

Power up system iscsi2 (e.g. by running xm create vm.cfg in /OVS/running_pool/64_iscsi2).
Let rac1 node catch up iSCSI connections to the iscsi2 (look at /var/log/messages for messages similar to “iscsid: connection4:0 is operational after recovery (314 attempts)”).
Perform ALTER DISKGROUP DATA1 ONLINE ALLon +ASM1 instance.
The missing failgroup should now be synchronizing:

PATH                                 DISKGROUP  FAILGROUP  MOUNT_S MODE_ST
------------------------------------ ---------- ---------- ------- -------
/dev/iscsi/racdata1.asm1/lun0/part1  DATA1      DATA1_0000 CACHED  ONLINE
/dev/iscsi/racdata2.asm1/lun0/part1  DATA1      DATA1_0001 CACHED  SYNCING

After some time the ASM failgroups should be fully operational (MODE_STATUS=ONLINE).
Voting disk should be automatically on-lined by CSS daemon.
Ocrcheck should indicate that one of the OCR mirrors is not synchronized.

[root@rac1 bin]#  
                                       
./ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          2
         Total space (kbytes)     :     327188
         Used space (kbytes)      :       3848
         Available space (kbytes) :     323340
         ID                       : 2120916034
         Device/File Name         : /dev/iscsi/racdata1.ocr/lun0/part1
                                    Device/File integrity check succeeded
         Device/File Name         : /dev/iscsi/racdata2.ocr/lun0/part1
                                    Device/File needs to be synchronized with the other device
                                        

         Cluster registry integrity check succeeded
[root@rac1 bin]#

In order to fix this you just run ocrconfig, then you can check /u01/app/crs/log/rac1/crsd/crsd.log for information about the new OCR mirror (actions taken during replacement).

[root@rac1 bin]#  
                                      
./ocrconfig -replace ocrmirror /dev/iscsi/racdata2.ocr/lun0/part1
[root@rac1 bin]#

Power up system rac2 to form fully operational extended cluster.

11. Next Steps

There are many ways to make a RAC cluster even more unbreakable. First, you could make a redundant interconnect using the Linux bonding driver. The same applies to IO multi-pathing with iSCSI. Performance and testing would be much better were you to use more disks, especially dedicated ones for iSCSI OpenFiler systems.

12. Acknowledgements

I would like to thank the following people:

Radosław Mańkowski for mentoring and fascinating me with Oracle.
Mariusz Masewicz and Robert Wrembel for allowing me to write about Extended RAC as a project at Poznan University of Technology.
Jeffrey Hunter or the best-ever article about RAC installation, which this document is based on.
Kevin Closson ( http://kevinclosson.wordpress.com/) and Dan Norris ( http://www.dannorris.com/) for technical discussions about (Direct) NFS with Extended RAC.

Jakub Wartak [ jakub.wartak@gmail.com] graduated in February 2008 from Poznan University of Technology, Poland, with a BSc in Computer Science and is currently pursuing an MSc in Computer Science. He is an Oracle Certified Associate for Oracle 10g (DBA), Sun Certified System Administrator for Solaris 10, and Sun Certified Network Administrator for Solaris 10. Previously he has worked as a freelance UNIX/Linux administrator. Since September 2008 Jakub has worked as a junior DBA in GlaxoSmithKline.