Dual-Site Failures

Various dual-site failure scenarios are possible. Each case is characterized by a full system outage - i.e., the application is no longer available. This section identifies the recovery mechanism for each scenario. In each scenario, Site A is initially the primary and Site B is initially the secondary, before any failure.

In the following scenarios, the secondary site fails first. Once the secondary system fails, the primary continues to operate. The primary system's replication Source Server is unable to send updates to the secondary system. Thus, there is a queue of non-replicated transactions at the failure point on the primary. Then, the primary site fails before the secondary site recovers, which leads to a dual-site outage. Operating procedures differ according to which site recovers first.

Site A Recovers First

On Site A:

  • Rollback the database to the last committed transaction (last application-consistent state).

  • Create new journal files.

  • Start the Source Server.

  • Start the application servers. Application availability is now restored.

  • If the state of the database indicates that batch operations were in process, restart batch operations.

On Site B, when it recovers:

  • Rollback the database to the last committed transaction.

  • Create new journal files.

  • Start the Source Server in passive mode.

  • Start the Receiver Server. Dual-site operation is now restored.

  • Start the passive application servers, as appropriate.

Site B Recovers First

On Site B:

  • Rollback the database to the last committed transaction (last application-consistent state).

  • Create new journal files.

  • Start the Source Server.

  • Start the application servers. Application availability is now restored.

  • If the state of the database indicates that batch operations were in process, restart batch operations.

On Site A, when it recovers:

  • Query the primary as to the journal sequence number at which it became primary, and roll back the secondary database to this point. Transmit the transactions that were backed out of the database by the rollback to the primary for reconciliation/reapplication.

  • Create new journal files.

  • Start the Source Server in passive mode.

  • Start the Receiver Server to resume replication as the new secondary. Dual-site operation is now restored.

  • Start the passive application servers, as appropriate.

  • In the following scenarios, the primary site fails first, causing a failover to Site B. Site B operates as the primary and then fails.

Site B Recovers First

On Site B:

  • Roll back the database to the last committed transaction (last application-consistent state).

  • Create new journal files.

  • Start the Source Server.

  • Start the application servers. Application availability is now restored. If the state of the database indicates that batch operations were in process, restart batch operations.

On Site A, when it recovers:

  • Query the primary as to the journal sequence number at which it became primary, and the secondary database should be rolled back to this point. Transmit the transactions backed out of the database by the rollback to the primary for reconciliation/reapplication.

  • Create new journal files.

  • Start the Source Server in passive mode.

  • Start the Receiver Server to resume replication as the new secondary. Dual-site operation is now restored.

  • Start the passive application servers, as appropriate.

Site A Recovers First

On Site A:

  • Roll back the database to the last committed transaction (last application-consistent state).

  • Create new journal files.

  • Start the Source Server.

  • Start the application servers. Application availability is now restored.

  • If the state of the database indicates that batch operations were in process, restart batch operations.

On Site B, when it recovers:

  • Roll back all transactions that were processed when it was the primary. Transmit the transactions backed out of the database by the rollback to the primary for reconciliation/reapplication.

  • Create new journal files.

  • Start the Source Server in passive mode.

  • Start the Receiver Server to resume replication as the new secondary. Dual-site operation is now restored.

  • Start the passive application servers, as appropriate.