Let us learn Oracle Database Administration skills: MSSQL: When an Always On Availability Group Breaks After OS Patching: Tracing a Hidden NSG Port Change

Sunday, December 14, 2025

MSSQL: When an Always On Availability Group Breaks After OS Patching: Tracing a Hidden NSG Port Change

Objective: One fine day on a non‑prod SQL Server environment, an Always On Availability Group breaks. The secondary database suddenly shows “data loss” and “not synchronizing” any more, and suspicion initially points to a recent Windows OS patching cycle and server reboot on the nodes involved in the AG.

Cause: Investigation proves that the database mirroring/AG endpoint port in Azure (default 5022) is blocked between the primary and secondary replicas. Port 5022 is vital for data synchronization between availability replicas, and Azure guidance explicitly calls out that Network Security Groups must allow this endpoint traffic between the servers hosting the AG. On further assessment, it becomes clear that the NSG port rule was changed at some point, but there is no direct evidence of when that change took effect.

AG configuration in this scenario is:

Failover: automatic
Seeding: manual
Sync mode: synchronous‑commit.
This means the design expectation is minimal or no data loss and transparent failover when the secondary is healthy and reachable.

What are the instruments in SQL Server that can provide more details on the exact timeline of the issue from a DBA perspective? Two key sources stand out: the SQL Server error log and AG/WSFC events. These surfaces, combined, can reconstruct not just the failover sequence but also the last moment when the endpoint port was known to be working.

SQL Server error log is easy to access and filter, and three patterns are especially useful to understand the story.

Event 1 is the role change messages, where the availability group database reports transitions such as PRIMARY → RESOLVING → SECONDARY (and the reverse on the partner node) during failover. The actual log ordering is reverse chronological, so reading from bottom to top is important when following the chain of events.

Actual message:

The availability group database "<db_name>" is changing roles from "PRIMARY" to "RESOLVING" because the mirroring session or availability group is in the process of failover. This is an informational message only. No user action is required.

The availability group database "<db_name>" is changing roles from "RESOLVING" to "SECONDARY" because the mirroring session or availability group has completed failover. This is an informational message only. No user action is required.

vice versa for the new Secondary to Primary.

If all four role‑change messages appear for the database on both nodes, a clean role transition has occurred; if the sequence stops at “PRIMARY to RESOLVING” and does not progress, it indicates a problem where the intended secondary cannot assume the primary role in an automatic failover configuration. This is a strong signal that something outside the database engine, such as cluster health or networking, is blocking the transition.

Event 2 is the Windows Server Failover Clustering (WSFC) messages, where the error log shows failures to create or use WSFC notification ports with error code 5073 or related cluster errors. These messages indicate that the WSFC service is not accessible in its current state or cannot bring a resource online, and Microsoft’s documentation explicitly associates such errors with WSFC or cluster configuration issues.

Actual message:

Failed to create a Windows Server Failover Clustering (WSFC) notification port with notification filter <notification_filter> and notification key 3 (Error code 5073). If this is a WSFC availability group, the WSFC service may not be running or may not be accessible in its current state, or the specified arguments are invalid. Otherwise, contact your primary support provider. For information about this error code, see “System Error Codes” in the Windows Development documentation.

Event 3 is the hardened LSN and commit time messages, which record information such as hardened LSN, commit LSN, and commit time for the database as seen by the secondary. These values help identify the exact point up to which the secondary successfully hardened log blocks, and articles on AG internals explain that hardened LSN represents the last transaction log block flushed to disk on the secondary.

Actual Message:

Information for Database <dbname> - Hardened Lsn: (xxx:yyy:n) commit LSN: (xxx:yyy:n) commit time: <mmm dd yyyy hh24:mi>.

Putting these three event types together gives a structured way to rebuild the timeline:

1. Filter mainly for the "Changing roles" for 1 AG database

2. Go to the date on which you see all 4 or atleast 2 successfule role transition messages in 1 node

3. Filter the whole date to narrow down the following info:

A. if node A has all 2 messages and let us assume it became primary from secondary, assess what happened to node B; was there a server reboot for OS patching or any other reasons

B. Examine the commit timeline to ensure the time until when 5022 port is known to be working.

With this approach, if node A shows a complete transition from SECONDARY to PRIMARY while node B logs a reboot around that time, it becomes possible to correlate OS patching or restart events with the AG failover behavior. The hardened LSN/commit time then sets an upper bound on when NSG and firewall rules still allowed traffic, strongly suggesting that a blocked AG endpoint port in Azure is behind the communication break.

Unlike Oracle in maximum protection mode, SQL Server availability groups do not enforce a hard halt of the primary when secondary replicas become unavailable, especially in synchronous‑commit deployments. This means the primary can continue serving traffic even while the secondary is disconnected by a blocked endpoint port, increasing the importance of proactive AG health monitoring and periodic review of NSG and firewall rules for ports such as 5022.

An open question remains: in environments where NSG rules are changed out of band, what can delay the effect of a port change until a server reboot or similar event, and how should platform teams design monitoring to detect these changes before they impact AG synchronization? That question sits at the intersection of SQL Server HADR behavior, WSFC diagnostics, NSG rule propagation, and operational practices on Azure virtual machines.

Let us learn Oracle Database Administration skills

Sunday, December 14, 2025

MSSQL: When an Always On Availability Group Breaks After OS Patching: Tracing a Hidden NSG Port Change

No comments:

Post a Comment

Ingest csv data into Apache Iceberg using spark notebook