Distributed AG – The Things That Can Trip You Up

Overview

In a previous article I demonstrated how to use a Distributed Availability Group (DAG) to move an Availability Group (AG) from one domain to another.

Of course, there are things to watch out for.

The Things to Watch Out For

This isn’t an exhaustive list – it’s just based on my experience with the implementation that I worked on.

Everybody’s Favourite

Firewalls. This is the thing that everybody will want you to check, whether it is your associates, any consultants or Microsoft staff – they will all ask that you check the firewall.

This is a fair point, because if there are connectivity issues, even the default error message returned from the DAG queries will mention the possibility of a firewall issue. Therefore you need to eliminate that possibility.

One thing that did catch us out, is the requirement for the default port 5022 – required for the synchronisation. It wasn’t particularly clear in the articles that we came across.

Log Block Sizes

“There have been <a ridiculously large number> misaligned IOs which required falling back to synchronous IO. The current IO is on file <source log file full path name>”

This is caused by a mismatch of the sector size, between the source AG log disk and the target AG log disk.

With Azure (the target we were migrating to), the sector size is 4KB. The sector size on our source AG was 512 bytes.

Replication across the DAG still works with this issue, but there is an obvious performance impact. The way to cater for this issue is to set trace flag 1800. This flag needs to be set prior to the creation of the DAG, so you may need to destroy the DAG and re-create it.

Failover Within an AG

Within my test environment, I could perform a failover within either AG and the replication across the DAG would continue without interruption. After all, it (theoretically) joins the Listeners, so should move when they move.

However, this behaved in an entirely different way when we used it in the live environments.

It just didn’t work.

In the live environment the failover the status of the DAG would change to ‘NOT HEALTHY’ and replication across the AG would fail.

In this situation, we had to go to the opposite AG and alter the DAG settings on that, to update the LISTENER_URL property.

Listing 1: Altering the DAG Properties

ALTER AVAILABILITY GROUP [<DAG Name>]  
MODIFY AVAILABILITY GROUP ON  
 '<AG Name>' WITH    
    (   
        LISTENER_URL = 'TCP://xx.xxx.xx.xx:5022' /* IP or FQDN of the node we now want to pair with */
    )

It is worth noting that as long as the DAG actually exists, the replication will just be suspended between the AGs. It appears to act just like a ‘normal’ AG, when a node or database is suspended from replication. As a test, we deliberately broke the DAG for a weekend, by triggering a failover on one the AGs. When we corrected it with this change to LISTENER_URL, it caught up quite quickly.

Summary

As stated earlier, this is not an exhaustive list of issues – merely the issues that we encountered. I would appreciate it if others gave details of issues that they encountered, as I still find this an interesting subject, even though I have (in theory) finished with it.

References

Different Sector Sizes Between Replicas

Notes on SQL

Random articles from a puzzled DBA