Thursday, December 18, 2025

MSSQL vs Oracle on patching the binary before installing them

 Slipstream in MSSQL:  Apply CU on SQL binary even before setting up the instance


cd <Downloaded SQL binary path>\

setup.exe /Action=Upgrade /UpdateEnabled=True /UpdateSource="<path of downloaded CU>"


You do the same in oracle using..


./runInstaller -applyRU <RUBinary> -applyOneOffs <oneoff1>,<oneoff2> etc...


All these helps in applying the fix while an upgrade is attempted. So you dont need any workarounds :)

ORA-46952 password is just OK, then why?

 This is expected in an oracle dataguard environment (version 19c mainly, but any version > 12.2 expected to face this issue)


Have you wondered if you will ever had to start the MRP using "ALTER DATABASE " command from sqlplus after having broker configured?


When you downgrade the password file format from 12.2 to 12 and then try putting it back to 12.2, you will hit ORA-46952 "format mismatch for the password file on a standby database"

On this circumstance, the broker based recovery will kick off MRP and MRP will fail in next few mins.


The workaround as per oracle KEDB is to


1. Remove the password file from standby

2. Initiate recovery using "ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT FROM SESSION;"

3. Copy the password file from primary to standby

4. Now MRP should be stable

5. Send couple of archives over and verify

6. Now stop/start MRP from broker, you should be all set.


Note any other tries you make, like recreating the password file format 12 or 12.2 again. Copy old or create new password file, none of them will work.

The above method only will work, no idea what the cause is yet.


Yes there is a 12.2 password format new feature which sync password changes in primary with standby using redo log now (which can also be controlled through a log ship setting)


Phew! as my son says what a day :)

Sunday, December 14, 2025

MSSQL 2016+ Auto seeding feature on AG

 Fixing broken AG..

Without auto seeding:

You will need to ..

  1. Ensure endpoints in AG properly configured

  2. Suspend HADR (primary)

ALTER DATABASE [DBName] SET HADR SUSPEND;

  1. Remove the db from AG (primary)

ALTER AVAILABILITY GROUP [AGName] REMOVE DATABASE [DBName];

  1. Drop the database (Secondary)

DROP DATABASE [DBName];

  1. Backup the database + TX log (Primary)

BACKUP DATABASE [DBName] TO DISK = 'C:\Backup\DBName_Full.bak' WITH COMPRESSION; BACKUP LOG [DBName] TO DISK = 'C:\Backup\DBName_Log.trn' WITH COMPRESSION;

  1. Copy the backup

  2. Restore them in target  (Secondary)

RESTORE DATABASE [DBName] FROM DISK = 'C:\Backup\DBName_Full.bak' WITH NORECOVERY;

RESTORE LOG [DBName] FROM DISK = 'C:\Backup\DBName_Log.trn' WITH NORECOVERY;


  1. Join AG (Secondary)

ALTER DATABASE [DBName] SET HADR AVAILABILITY GROUP = [AGName];

  1. Add db back to AG (Primary)

ALTER AVAILABILITY GROUP [AGName] ADD DATABASE [DBName];

  1. Resume HADR (primary)

ALTER DATABASE [DBName] SET HADR RESUME;

Verify DB error log in each step apart from the feedback message for each command. So ideally it demands a lot of effort.

With Autoseeding: 

The manual db backup, backup copy, restore of the same is completely ruled out. Rest stays. Step sequence:

  1. Ensure endpoints in AG properly configured

  2. Suspend HADR (primary)

ALTER DATABASE [DBName] SET HADR SUSPEND;

  1. Remove the db from AG (primary)

ALTER AVAILABILITY GROUP [AGName] REMOVE DATABASE [DBName];

  1. Drop the database (Secondary)

DROP DATABASE [DBName];

  1. Add db back to AG (Primary)

ALTER AVAILABILITY GROUP [AGName] ADD DATABASE [DBName];

  1. Post the db addition back to AG, with AG property set to “SEEDING_MODE = AUTOMATIC”, AG starts auto seeding the database from primary to secondary. Just plan it properly such that we don’t halt business.

  2. Resume HADR (primary)

ALTER DATABASE [DBName] SET HADR RESUME;

Refer: https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/automatically-initialize-always-on-availability-group?view=sql-server-ver17

Additional Info:

  1. Oracle FSFO is comparable to MSSQL’s AG. AG uses WSFC & quorum disks vs oracle uses observer node.

  2. Oracle’s flashback based reinstatement of the old primary to act as standby or secondary isn’t available in MSSQL, we will need to restore the whole db in case of disaster.

  3. Duplicate target db is a feature of oracle 11.2 which was released atleast 5 or 6 years before MSSQL. The duplicate db feature can be used to build standby or create a clone of prod.

Version upgrade of MSSQL 2012 SP4 to 2022 upgrade xprepl.dll not found!!!

 Version upgrade of MSSQL 2012 SP4 to 2022 upgrade xprepl.dll not found!!!

Objective: You were given an objective to upgrade your MSSQL from version 2012 SP4 to 2022 in place!! Scary at first place isn’t it 😊 Since most of the MSSQL DBAs don’t prefer this approach, easiest is to build a parallel instance or parallel cluster and unplug/plug or restore the db directly and finish the job 😊

Ok, so the challenge is taken, as a general practise for MSSQL db upgrade, the following information are collected…

  1. DB design, including its File layout

  2. DB version, compatibility

  3. DB parameter settings

  4. Service account details

  5. Agent Job details

  6. Features usage details

  7. Version compatibility details (OS and DB) for the upgrade

  8. DBCC checkdb

Db config:

  1. AG (2 node replica)

  2. CDC job is configured

Post a full backup of the db, the setup.exe from 2022 binary was kicked off choosing the “Upgrade from a previous version of SQL Server”, on clicking next, the screen rolls over, prechecks and then to final confirmation message asking to confirm upgrade. Confirmed and eyes glued to the screen to see a successful

  1. Message #1 – supply me the MSODB17 binary, we can supply that by downloading the binary

  2. Message#2 – reports your MSSQL upgrade failed (timed out waiting for SQL Server to come online) 

Further examination on error log reports below…

Could not load the DLL xprepl.dll, or one of the DLLs it references. Reason: 2(The system cannot find the file specified.).


On clicking “OK”, the upgrade screen runs for completion with its ask to perform a server reboot. You will notice “Database engine upgrade failed” message in upgrade summary.

You can read more info here https://learn.microsoft.com/en-us/answers/questions/2109340/issue-with-sql-server-updates-missing-xprepl-dll

On server reboot, you see your MSSQL Server coming online fine without any manual intervention. The system dbs are online with compatibility (160) set to the version matching 2022. No impact to any agent jobs, AG is replicating fine (you cant read though now, even if you have allowed all connections).

So what really was impacted?

Agent didn’t get upgraded!!!

In such situation, what should we do?

  1. Complete first CU application, current CU level 22 for 2022

  2. Repair the SQL server installation by running setup.exe again, this should fix agent version upgrade.

Cause: A bug which isn’t fixed until CU6 in 2022.

https://learn.microsoft.com/en-us/troubleshoot/sql/releases/sqlserver-2022/cumulativeupdate6#2442348


Note the xprepl.dll isnt found in 2012 SP4 or 2022 CU22 or base 2022 version binaries. It is a mystery why MSSQL reports this issue in place MSSQL upgrade.


MSSQL: When an Always On Availability Group Breaks After OS Patching: Tracing a Hidden NSG Port Change

Objective: One fine day on a non‑prod SQL Server environment, an Always On Availability Group breaks. The secondary database suddenly shows “data loss” and “not synchronizing” any more, and suspicion initially points to a recent Windows OS patching cycle and server reboot on the nodes involved in the AG.

Cause: Investigation proves that the database mirroring/AG endpoint port in Azure (default 5022) is blocked between the primary and secondary replicas. Port 5022 is vital for data synchronization between availability replicas, and Azure guidance explicitly calls out that Network Security Groups must allow this endpoint traffic between the servers hosting the AG. On further assessment, it becomes clear that the NSG port rule was changed at some point, but there is no direct evidence of when that change took effect.

AG configuration in this scenario is:

  • Failover: automatic

  • Seeding: manual

  • Sync mode: synchronous‑commit.
    This means the design expectation is minimal or no data loss and transparent failover when the secondary is healthy and reachable.

What are the instruments in SQL Server that can provide more details on the exact timeline of the issue from a DBA perspective? Two key sources stand out: the SQL Server error log and AG/WSFC events. These surfaces, combined, can reconstruct not just the failover sequence but also the last moment when the endpoint port was known to be working.

SQL Server error log is easy to access and filter, and three patterns are especially useful to understand the story.


Event 1 is the role change messages, where the availability group database reports transitions such as PRIMARY → RESOLVING → SECONDARY (and the reverse on the partner node) during failover. The actual log ordering is reverse chronological, so reading from bottom to top is important when following the chain of events.

Actual message:

The availability group database "<db_name>" is changing roles from "PRIMARY" to "RESOLVING" because the mirroring session or availability group is in the process of failover. This is an informational message only. No user action is required.​

The availability group database "<db_name>" is changing roles from "RESOLVING" to "SECONDARY" because the mirroring session or availability group has completed failover. This is an informational message only. No user action is required.

vice versa for the new Secondary to Primary.

If all four role‑change messages appear for the database on both nodes, a clean role transition has occurred; if the sequence stops at “PRIMARY to RESOLVING” and does not progress, it indicates a problem where the intended secondary cannot assume the primary role in an automatic failover configuration. This is a strong signal that something outside the database engine, such as cluster health or networking, is blocking the transition.

Event 2 is the Windows Server Failover Clustering (WSFC) messages, where the error log shows failures to create or use WSFC notification ports with error code 5073 or related cluster errors. These messages indicate that the WSFC service is not accessible in its current state or cannot bring a resource online, and Microsoft’s documentation explicitly associates such errors with WSFC or cluster configuration issues.

Actual message:

Failed to create a Windows Server Failover Clustering (WSFC) notification port with notification filter <notification_filter> and notification key 3 (Error code 5073). If this is a WSFC availability group, the WSFC service may not be running or may not be accessible in its current state, or the specified arguments are invalid. Otherwise, contact your primary support provider. For information about this error code, see “System Error Codes” in the Windows Development documentation.

Event 3 is the hardened LSN and commit time messages, which record information such as hardened LSN, commit LSN, and commit time for the database as seen by the secondary. These values help identify the exact point up to which the secondary successfully hardened log blocks, and articles on AG internals explain that hardened LSN represents the last transaction log block flushed to disk on the secondary.

Actual Message:

Information for Database <dbname> - Hardened Lsn: (xxx:yyy:n) commit LSN: (xxx:yyy:n) commit time: <mmm dd yyyy hh24:mi>.​

Putting these three event types together gives a structured way to rebuild the timeline:

1. Filter mainly for the "Changing roles" for 1 AG database

2. Go to the date on which you see all 4 or atleast 2 successfule role transition messages in 1 node

3. Filter the whole date to narrow down the following info:

A. if node A has all 2 messages and let us assume it became primary from secondary, assess what happened to node B; was there a server reboot for OS patching or any other reasons

B. Examine the commit timeline to ensure the time until when 5022 port is known to be working.

With this approach, if node A shows a complete transition from SECONDARY to PRIMARY while node B logs a reboot around that time, it becomes possible to correlate OS patching or restart events with the AG failover behavior. The hardened LSN/commit time then sets an upper bound on when NSG and firewall rules still allowed traffic, strongly suggesting that a blocked AG endpoint port in Azure is behind the communication break.

Unlike Oracle in maximum protection mode, SQL Server availability groups do not enforce a hard halt of the primary when secondary replicas become unavailable, especially in synchronous‑commit deployments. This means the primary can continue serving traffic even while the secondary is disconnected by a blocked endpoint port, increasing the importance of proactive AG health monitoring and periodic review of NSG and firewall rules for ports such as 5022.

An open question remains: in environments where NSG rules are changed out of band, what can delay the effect of a port change until a server reboot or similar event, and how should platform teams design monitoring to detect these changes before they impact AG synchronization? That question sits at the intersection of SQL Server HADR behavior, WSFC diagnostics, NSG rule propagation, and operational practices on Azure virtual machines.

Ingest csv data into Apache Iceberg using spark notebook

  Objective: Ingest csv data into Apache Iceberg using spark notebook Steps: 1. Start our VM 2. start our docker containers cd /opt/de [root...