Notes on SQL

Random articles from a puzzled DBA

Subtitle – Distributed AG, the Gift That Keeps on Giving

Overview

Unexpectedly, we had a connection timeout/failure between a couple of nodes of our Availability Group (AG). None of the documented causes of this (that we could find) were responsible for this, until somebody asked a very simple question….

The Problem

Having migrated across from an on-premises AG to an AG based on Azure Virtual Machines, by use of a Distributed AG (DAG), things were working quite merrily.

Then one day, one of the nodes failed, with what appeared to be connection issues.

Various error messages were displayed that indicated either connection timeouts or forced connection closures.

The variety of errors were:

An error occurred while receiving data. ‘10054(An existing connection was forcibly closed by the remote host )’.

A connection timeout has occurred while attempting to establish a connection to availability replica ‘<replica name>’ with id [<ID Number>]

Or

The target principal name is incorrect. Cannot generate SSPI context. (Microsoft SQL Server)

There are a wide range of reasons for these messages, and I’ll include some of them at the foot of this article, but none of them really pointed to the cause of the issue that we were experiencing.

While attempting to correct this issue, a second node also started to display the same errors. With the stability of the system now under serious threat several people were attempting to check the more usual causes for such issues.

Then somebody asked if we were using any certificates within the SQL environment.

Interestingly, one of our Operations staff then ran a scan on the server, looking for any expired certificates, which came up empty. All certificates that it could find were fine.

However, I knew that there was a certificate within SQL Server, that was not mentioned by this external scan.

When we created the DAG, a year earlier, it required a certificate as part of its process. We had removed the DAG itself, once we had successfully migrated across, but not the certificate. Surely enough, when we examined that certificate, it had expired a short time earlier. Although the time that the connection issues had started was a couple of hours after the certificate had expired, it was worth looking at. We would have expected the issues to happen as soon as the certificate expired, but it was the only issue that we had found within our environment.

The Solution

For the DAG purposes, the certificate is associated with the mirroring endpoint. The T-SQL code for this is:

Listing 1: The original code for the endpoint and certificate


USE [master]
GO

CREATE ENDPOINT [Hadr_endpoint] 
	STATE=STARTED
	AS TCP (LISTENER_PORT = 5022, LISTENER_IP = ALL)
	FOR DATA_MIRRORING (ROLE = ALL, AUTHENTICATION = CERTIFICATE [dag_certificate] WINDOWS NEGOTIATE
, ENCRYPTION = REQUIRED ALGORITHM AES)
GO

Looking at this code, we assumed that if there were any issues with the certificate, it would default to ‘WINDOWS NEGOTIATE’. The reason we used two authentication methods was to reduce downtime on the nodes, when ‘interfering’ with the connections during the DAG construction. Should

However, in the absence of anything else that looked suspicious, we updated the endpoint on each node thus:

Listing 2: The updated endpoint code

USE [master]
GO

DROP ENDPOINT [Hadr_endpoint]
GO

CREATE ENDPOINT [Hadr_endpoint] 
	STATE=STARTED
	AS TCP (LISTENER_PORT = 5022, LISTENER_IP = ALL)
	FOR DATA_MIRRORING (ROLE = ALL, AUTHENTICATION = WINDOWS NEGOTIATE
, ENCRYPTION = REQUIRED ALGORITHM AES)
GO

As soon as this new code was processed, replication resumed across the nodes.

Conclusion

The expired certificate was the cause of the problems, and we should have tracked the details of certificates within SQL Server. It wasn’t known, at the time, that the processes we had for monitoring such things did not check within the SQL instance itself.

References

Although none of the links below directly solved our issue, some of these may prove useful, if you are experiencing similar issues. Additionally, some of these contain interesting information related to the areas we were looking at. This is more of a brain-dump of the route that we took, when looking at this.

https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/availability-replica-is-disconnected?view=sql-server-ver16

https://learn.microsoft.com/en-us/troubleshoot/sql/database-engine/connect/cannot-generate-sspi-context-error

https://learn.microsoft.com/en-us/answers/questions/1036699/cant-connect-to-sql-server-anymore-due-to-sspi-con

https://learn.microsoft.com/en-us/sql/database-engine/configure-windows/register-a-service-principal-name-for-kerberos-connections?view=sql-server-ver16#Auto

https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-hadr-database-replica-states-transact-sql?view=azuresqldb-current

https://learn.microsoft.com/en-us/troubleshoot/sql/database-engine/availability-groups/troubleshooting-intermittent-connection-timeouts-availability-groups

https://learn.microsoft.com/en-us/sql/database-engine/database-mirroring/troubleshoot-database-mirroring-configuration-sql-server?view=sql-server-ver16#Endpoints

https://learn.microsoft.com/en-us/sql/t-sql/statements/alter-endpoint-transact-sql?view=sql-server-ver16

https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-exec-connections-transact-sql?view=sql-server-ver16

https://stackoverflow.com/questions/24207116/the-semaphore-timeout-period-has-expired-sql-azure

https://dba.stackexchange.com/questions/189278/error-976-severity-14-state-1-always-on-clustering

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.