Recovering and Validating Data After Unexpected SQL Server Failovers

  • Thread starter Thread starter Mitch_van_Huuksloot
  • Start date Start date
M

Mitch_van_Huuksloot

In SQL Server environments, including on-premises, Azure SQL Database, or SQL Managed Instance, unexpected failovers can sometimes leave Availability Group roles out of sync. When this happens, the new primary replica might take over while the original primary faces issues, leading to possible data discrepancies if transactions were in progress or uncommitted. To recover or validate any data loss, it’s essential to resync or validate the new primary with the old primary as soon as it’s back online. This ensures any lost transactions are reconciled. If critical data was missed during the failover, it must be recovered and merged to maintain database consistency. This blog outlines the steps to recover or validate any data loss using SQL Server Database Compare Utility.



To simulate this situation with the SQL Server Box product, two SQL Server VMs were set up on the same subnet, Always-On HA was turned on for both instances, a simplified WideWorldImportersDW (WWIDW) database was restored and set to full recovery mode. After creating the certificates and logins/users required for Availability Groups (AGs) without domains, an asynchronous commit AG was created (it needed to be asynchronous, since we cannot simulate transaction loss with a synchronous AG).



Mitch_van_Huuksloot_0-1723479461596.png



Using a transaction simulator, many transactions per second were performed into the primary database.



Mitch_van_Huuksloot_1-1723479461601.png



During this activity, the primary instance was stopped, which created a group of committed transactions that had not been replicated to the secondary. The goal of this post is to recover those missing transactions.



After stopping sql1, you need to open the dashboard on sql2, because SSMS can no longer connect to sql1 to update the status:



Mitch_van_Huuksloot_2-1723479461605.png



Opening the Failover wizard – you see a warning about data loss:



Mitch_van_Huuksloot_3-1723479461608.png



The wizard really wants to make sure you know there may be data loss:



Mitch_van_Huuksloot_4-1723479461612.png



Click through to get the results:



Mitch_van_Huuksloot_5-1723479461615.png



The dashboard for sql2 has the current cluster status:



Mitch_van_Huuksloot_6-1723479461618.png



To capture the state of the new primary, immediately create a database snapshot, ideally before opening it up to new application connections:



Mitch_van_Huuksloot_7-1723479461621.png



(In this case we were using sql1 in the transaction simulator and not the sqlistener, so no transactions will be written to sql2).



Restart the sql1 instance. On refresh, you will see that the old primary now has a state of Not Synchronizing / In Recovery.



Mitch_van_Huuksloot_8-1723479461622.png



Create a database snapshot on the old primary (sql1) before making the old primary the secondary (otherwise, when the old primary synchronizes as a secondary, the missing transactions are lost:(



Mitch_van_Huuksloot_9-1723479461625.png



At this point you can Resume Data Movement to make the old primary a secondary.



Mitch_van_Huuksloot_10-1723479461627.png



It takes a minute for the old primary to roll back lost transactions and resynchronize with the new primary and the old primary should now be a synchronizing secondary;



Mitch_van_Huuksloot_11-1723479461629.png



We can now use the SQL Server Database Compare (SSDBC) application to check if we lost any transactions between the snapshot on the new primary (sql2 – the source) and the snapshot on the old primary (sql1 – the target). Refer to the SSDBC documentation for how to set it up.



Mitch_van_Huuksloot_12-1723479461630.png



In this case we have 313 hash differences (updates) and 423 missing (inserts) rows – note that we have no deletes that were not replicated because of timing of the simulated transaction cascading deletes (deletes stop if we exceed the configured percentage until the inserts/updates catch up):



Mitch_van_Huuksloot_13-1723479461643.png



Looking at the SSDBC folder in My Documents – we can see a SQL script file for each table in the database, with numbers in front to indicate the order you should run them in (based on foreign key references). If the database has DRI configured, you may need to combine scripts if the referenced table has an identity column and if you have the identity value capture option turned on. In the same folder are log files which contain more details about the comparison.



Mitch_van_Huuksloot_14-1723479461653.png



Opening one of the files, you can see update and insert statements. The update statements are written to try to be as safe as possible by checking (with additional conditions in the where clause) that the column value hasn’t been subsequently changed on the new primary.



Mitch_van_Huuksloot_15-1723479461669.png



After running all the change scripts on WWIDW on sql2 and then re-running SSDBC on WWIDW on sql2 (not the snapshot) and the sql1 snapshot – we see that the databases are now the same (note that this will only work on a static new primary);



Mitch_van_Huuksloot_16-1723479461677.png



Note that we could also use the tablediff utility or SSDT to generate a differences script.



The SSDBC download package contains a PowerShell script to make running the comparison operations across many servers/databases easier by making it Excel workbook driven. You can list your source and target servers and databases in the provided Servers.xlsx workbook and run the BulkDatabaseRecovery.ps1 script.



Feedback and suggestions​


If you have feedback or suggestions for improving this team asset, please contact the Azure Databases SQL Customer Success Engineering Team. Thank you for your support!

Continue reading...
 
Back
Top