I quite often get calls logged asking for help understanding why the active copy of a DAG database moves from one server to another. There can be a number of reasons for this, not all of them particularly well recorded in the event logs – a favourite is the DAG networks not being collapsed when they span sites, and therefore different subnets, but that’s not what I wanted to write about.
Quite often, the best way to understand what happened is to go through the failover cluster log – if you’ve not looked at this log before, I urge you to try it, particularly if you suffer from insomnia. In Windows 2008 r2 you can have a look at it by running get-clusterlog –destination <location> in powershell.
A normal cluster log would look something like this:
000016c0.0000162c::2014/03/12-12:15:15.892 INFO [GUM] Node 2: Processing RequestLock 4:689542 000016c0.00003dcc::2014/03/12-12:15:15.892 INFO [GUM] Node 2: Processing GrantLock to 4 (sent by 1 gumid: 6354235) 000016c0.000015b4::2014/03/12-12:15:23.192 INFO [GUM] Node 2: Processing RequestLock 2:144215 000016c0.0000162c::2014/03/12-12:15:23.192 INFO [GUM] Node 2: Processing GrantLock to 2 (sent by 4 gumid: 6354236)
With a couple of events every few seconds. At this rate of generation, the default log size of 100MB is usually enough for about 24 hours worth of events. However, say you have a problem (like DAG networks not being collapsed correctly, as below*)? Then your log may look more like this:
000018bc.00001998::2014/02/13-11:53:54.854 DBG [NETFTAPI] Signaled NetftRemoteUnreachable event, local address xxx.xxx.41.xxx:003853 remote address xxx.xxx.141.xxx:003853 000018bc.0000199c::2014/02/13-11:53:54.854 INFO [IM] got event: Remote endpoint xxx.xxx.141.xxx:~3343~ unreachable from xxx.xxx.41.xxx:~3343~ 000018bc.0000199c::2014/02/13-11:53:54.854 INFO [IM] Marking Route from xxx.xxx.41.xxx:~3343~ to xxx.xxx.141.xxx:~3343~ as down 000018bc.0000199c::2014/02/13-11:53:54.854 INFO [NDP] Checking to see if all routes for route (virtual) local fe80::b8ac:d730:1392:4e4d:~0~ to remote fe80::698d:34a4:a5c9:2e77:~0~ are down 000018bc.0000199c::2014/02/13-11:53:54.854 INFO [NDP] Route local xxx.xxx.201.xxx:~3343~ to remote xxx.xxx.202.xxx:~3343~ is up 000018bc.0000199c::2014/02/13-11:53:54.854 INFO [IM] Adding information for route Route from local xxx.xxx.41.xxx:~0~ to remote xxx.xxx.141.xxx:~0~, status: true, attributes: 0 000018bc.0000199c::2014/02/13-11:53:54.854 INFO [IM] Adding information for route Route from local xxx.xxx.41.xxx:~0~ to remote xxx.xxx.141.xxx:~0~, status: false, attributes: 0 000018bc.0000199c::2014/02/13-11:53:54.854 INFO [IM] Sending connectivity report to leader (node 2): <class mscs::InterfaceReport> 000018bc.0000199c::2014/02/13-11:53:54.854 INFO <fromInterface>d8430531-25e6-4749-8b1d-2bf5f06da430</fromInterface> 000018bc.0000199c::2014/02/13-11:53:54.854 INFO <upInterfaces><vector len='2'> 000018bc.0000199c::2014/02/13-11:53:54.854 INFO <item>d8430531-25e6-4749-8b1d-2bf5f06da430</item> 000018bc.0000199c::2014/02/13-11:53:54.854 INFO <item>62a2fefa-9b12-436d-a270-fec45ee86d23</item> 000018bc.0000199c::2014/02/13-11:53:54.854 INFO </vector> 000018bc.0000199c::2014/02/13-11:53:54.854 INFO </upInterfaces> 000018bc.0000199c::2014/02/13-11:53:54.854 INFO <downInterfaces><vector len='1'> 000018bc.0000199c::2014/02/13-11:53:54.854 INFO <item>c16aa803-1446-41d0-8b1f-338a6093ec37</item> 000018bc.0000199c::2014/02/13-11:53:54.854 INFO </vector> 000018bc.0000199c::2014/02/13-11:53:54.854 INFO </downInterfaces>
As you can see, the rate of entry generation has increased dramatically. In this particular example the default log size of 100mb covers approximately fifteen MINUTES. It would be a good idea, then, to increase the cluster log size from the default of 100MB to a larger number. 400MB is quoted in some of the literature, although not particularly strongly. The best article on this suggests 72 hours of log data should be kept, however in my experience the maximum log size of 1gb can sometimes only hold 12 hours of data. This is the best article, by the way. It also contains instructions for setting the cluster log size in Windows 2008. For 2008 r2, use set-clusterlog –size 1024
But nick, I can’t run get-clusterlog?
You need to import the failover clustering module
Start powershell as an administrator
Run import-module failoverclusters
And bob’s your uncle.
Oh, an how do I know that DAG networks aren’t collapsed? Well, first of all I can see there is a problem replicating across the nominated repl network:
000018bc.0000199c::2014/02/13-11:53:54.854 INFO [IM] got event: Remote endpoint xxx.xxx.141.xxx:~3343~ unreachable from xxx.xxx.41.xxx:~3343~ 000018bc.0000199c::2014/02/13-11:53:54.854 INFO [IM] Marking Route from xxx.xxx.41.xxx:~3343~ to xxx.xxx.141.xxx:~3343~ as down
The cluster then checks that all possible paths are down:
000018bc.0000199c::2014/02/13-11:53:54.854 INFO [NDP] Checking to see if all routes for route (virtual) local fe80::b8ac:d730:1392:4e4d:~0~ to remote fe80::698d:34a4:a5c9:2e77:~0~ are down
It is thrilled to see it can get there along another network:
000018bc.0000199c::2014/02/13-11:53:54.854 INFO [NDP] Route local xxx.xxx.201.xxx:~3343~ to remote xxx.xxx.202.xxx:~3343~ is up
If we run get-databaseavailabilitygroupnetwork then we can see there are 6 networks for this DAG, which is four too many. The six networks are two MAPI networks (one for each subnet, one subnet per physical AD site), which need collapsing, two replication networks which also need collapsing and two backup networks which need to be excluded from the DAG altogether. For more on sorting your DAG networks out, please see this article from Tim McMichael.