Tag Archives: DAG

Your cluster log is TINY, and the source of much amusement.

I quite often get calls logged asking for help understanding why the active copy of a DAG database moves from one server to another. There can be a number of reasons for this, not all of them particularly well recorded in the event logs – a favourite is the DAG networks not being collapsed when they span sites, and therefore different subnets, but that’s not what I wanted to write about.

Quite often, the best way to understand what happened is to go through the failover cluster log – if you’ve not looked at this log before, I urge you to try it, particularly if you suffer from insomnia. In Windows 2008 r2 you can have a look at it by running get-clusterlog –destination <location> in powershell.

A normal cluster log would look something like this:

000016c0.0000162c::2014/03/12-12:15:15.892 INFO  [GUM] Node 2: Processing RequestLock 4:689542
000016c0.00003dcc::2014/03/12-12:15:15.892 INFO  [GUM] Node 2: Processing GrantLock to 4 (sent by 1 gumid: 6354235)
000016c0.000015b4::2014/03/12-12:15:23.192 INFO  [GUM] Node 2: Processing RequestLock 2:144215
000016c0.0000162c::2014/03/12-12:15:23.192 INFO  [GUM] Node 2: Processing GrantLock to 2 (sent by 4 gumid: 6354236)

With a couple of events every few seconds. At this rate of generation, the default log size of 100MB is usually enough for about 24 hours worth of events. However, say you have a problem (like DAG networks not being collapsed correctly, as below*)? Then your log may look more like this:

000018bc.00001998::2014/02/13-11:53:54.854 DBG   [NETFTAPI] Signaled NetftRemoteUnreachable  event, local address xxx.xxx.41.xxx:003853 remote address xxx.xxx.141.xxx:003853
000018bc.0000199c::2014/02/13-11:53:54.854 INFO  [IM] got event: Remote endpoint xxx.xxx.141.xxx:~3343~ unreachable from xxx.xxx.41.xxx:~3343~
000018bc.0000199c::2014/02/13-11:53:54.854 INFO  [IM] Marking Route from xxx.xxx.41.xxx:~3343~ to xxx.xxx.141.xxx:~3343~ as down
000018bc.0000199c::2014/02/13-11:53:54.854 INFO  [NDP] Checking to see if all routes for route (virtual) local fe80::b8ac:d730:1392:4e4d:~0~ to remote fe80::698d:34a4:a5c9:2e77:~0~ are down
000018bc.0000199c::2014/02/13-11:53:54.854 INFO  [NDP] Route local xxx.xxx.201.xxx:~3343~ to remote xxx.xxx.202.xxx:~3343~ is up
000018bc.0000199c::2014/02/13-11:53:54.854 INFO  [IM] Adding information for route Route from local xxx.xxx.41.xxx:~0~ to remote xxx.xxx.141.xxx:~0~, status: true, attributes: 0
000018bc.0000199c::2014/02/13-11:53:54.854 INFO  [IM] Adding information for route Route from local xxx.xxx.41.xxx:~0~ to remote xxx.xxx.141.xxx:~0~, status: false, attributes: 0
000018bc.0000199c::2014/02/13-11:53:54.854 INFO  [IM] Sending connectivity report to leader (node 2): <class mscs::InterfaceReport>
000018bc.0000199c::2014/02/13-11:53:54.854 INFO    <fromInterface>d8430531-25e6-4749-8b1d-2bf5f06da430</fromInterface>
000018bc.0000199c::2014/02/13-11:53:54.854 INFO    <upInterfaces><vector len='2'>
000018bc.0000199c::2014/02/13-11:53:54.854 INFO      <item>d8430531-25e6-4749-8b1d-2bf5f06da430</item>
000018bc.0000199c::2014/02/13-11:53:54.854 INFO      <item>62a2fefa-9b12-436d-a270-fec45ee86d23</item>
000018bc.0000199c::2014/02/13-11:53:54.854 INFO  </vector>
000018bc.0000199c::2014/02/13-11:53:54.854 INFO  </upInterfaces>
000018bc.0000199c::2014/02/13-11:53:54.854 INFO    <downInterfaces><vector len='1'>
000018bc.0000199c::2014/02/13-11:53:54.854 INFO      <item>c16aa803-1446-41d0-8b1f-338a6093ec37</item>
000018bc.0000199c::2014/02/13-11:53:54.854 INFO  </vector>
000018bc.0000199c::2014/02/13-11:53:54.854 INFO  </downInterfaces>

As you can see, the rate of entry generation has increased dramatically. In this particular example the default log size of 100mb covers approximately fifteen MINUTES. It would be a good idea, then, to increase the cluster log size from the default of 100MB to a larger number. 400MB is quoted in some of the literature, although not particularly strongly. The best article on this suggests 72 hours of log data should be kept, however in my experience the maximum log size of 1gb can sometimes only hold 12 hours of data. This is the best article, by the way. It also contains instructions for setting the cluster log size in Windows 2008. For 2008 r2, use set-clusterlog –size 1024

But nick, I can’t run get-clusterlog?

You need to import the failover clustering module

Start powershell as an administrator

Run import-module failoverclusters

And bob’s your uncle.

Oh, an how do I know that DAG networks aren’t collapsed? Well, first of all I can see there is a problem replicating across the nominated repl network:

000018bc.0000199c::2014/02/13-11:53:54.854 INFO  [IM] got event: Remote endpoint xxx.xxx.141.xxx:~3343~ unreachable from xxx.xxx.41.xxx:~3343~
000018bc.0000199c::2014/02/13-11:53:54.854 INFO  [IM] Marking Route from xxx.xxx.41.xxx:~3343~ to xxx.xxx.141.xxx:~3343~ as down

The cluster then checks that all possible paths are down:

000018bc.0000199c::2014/02/13-11:53:54.854 INFO  [NDP] Checking to see if all routes for route (virtual) local fe80::b8ac:d730:1392:4e4d:~0~ to remote fe80::698d:34a4:a5c9:2e77:~0~ are down

It is thrilled to see it can get there along another network:

000018bc.0000199c::2014/02/13-11:53:54.854 INFO  [NDP] Route local xxx.xxx.201.xxx:~3343~ to remote xxx.xxx.202.xxx:~3343~ is up

If we run get-databaseavailabilitygroupnetwork then we can see there are 6 networks  for this DAG, which is four too many. The six networks are two MAPI networks (one for each subnet, one subnet per physical AD site), which need collapsing, two replication networks which also need collapsing and two backup networks which need to be excluded from the DAG altogether. For more on sorting your DAG networks out, please see this article from Tim McMichael.

Failing databases, sulking network manager

Interesting call here. After a hardware firewall change and a reboot, my customer’s DAG had a database copy in a failed state. The set up is a two node DAG across two sites, with a FSW and an Alternate FSW preconfigured. It’s also IL3. If you don’t know what IL3 is, please stop reading this article. You don’t have clearance. Look out the window. See the guy with the dark glasses watching you? No? THAT’S how good we are.

So, just change the firewall back, dummy. big deal. sheesh. Except, it didn’t fix the problem. interestingly – this is the first time that cluster failover has been tested… DAG has been tested a number of times.

So… he’s got one database copy mounted, one failed.
We ran get-mailboxdatabasecopystatus and saw this error; “replication server encountered transient network error. Network manager not yet initialised”
It’s been in this state for a while now, and through multiple reboots. Sitting watching it won’t help. It’s not really transient.
Oh right, so the FSW needs rebooting, right? I’ve seen this before…(http://port25guy.com/2012/12/10/witness-server-boot-time-getdagnetworkconfig-and-the-pain-of-exchange-2010-dr-tests/). No. The boottime cookies are the correct way around.
So we start checking things. the IP addresses show as down in the DAG. This picture is not their DAG. IL3, remember?

The cluster node shows as “down” in the failover cluster manager.
So, let’s see what happens when we try to start the node. Lots of errors in the event log (which I can’t see… IL3…), but one sticks out like a sore thumb – event id 4123:

Log Name: Application
Source: MSExchangeRepl
Date: 2/26/2012 11:12:08 AM
Event ID: 4123
Task Category: Service
Level: Error
Keywords: Classic
User: N/A
Computer: LABMBX-1.exlab.mydomain.com
Description:
Failed to get the boot time of witness server ‘labcas-1.exlab.mydomain.com’. Error: The remote procedure call failed. (Exception from HRESULT: 0x800706BE)

There’s a great big clue right there. “the remote procedure call failed”. For some reason the endpoint mapper on the FSW isn’t responding. This is a resource domain which just contains a DC, the two Exchange boxes and a Vcenter manager. (I did mention the VMWare, yes?) What is the FSW machine? Well, it’s the Vcenter console machine in the domain.

And there is the problem.

When you install exchange on a box, it adds a security group to the local admins group, and makes changes to the windows firewall (http://marksmith.netrends.com/Lists/Posts/Post.aspx?ID=83). When you put the FSW on a NON-Exchange box, you need to add the exchange trusted subsystem group to the local admins manually – you’ve not installed exchange, so setup won’t do it for you. It’s documented here: http://technet.microsoft.com/en-us/library/dd351172.aspx

If the witness server you specify isn’t an Exchange 2013 or Exchange 2010 server, you must add the Exchange Trusted Subsystem universal security group to the local Administrators group on the witness server. These security permissions are necessary to ensure that Exchange can create a directory and share on the witness server as needed. If the proper permissions aren’t configured, the following error is returned:
Error: An error occurred during discovery of the database availability group topology. Error: An error occurred while attempting a cluster operation. Error: Cluster API “AddClusterNode() (MaxPercentage=12) failed with 0x80070005. Error: Access is denied.”

What it doesn’t say, but assumes, is that RPC will work. Why does it need RPC? It’s just a fileshare, yes? It doesn’t say anything about RPC here: http://technet.microsoft.com/en-us/library/bb331973.aspx

• The Clustering data path listed in the preceding table uses dynamic RPC over TCP to communicate cluster status and activity between the different cluster nodes. The Cluster service (ClusSvc.exe) also uses UDP/3343 and randomly allocated, high TCP ports to communicate between cluster nodes.
• For intra-node communications, cluster nodes communicate over User Datagram Protocol (UDP) port 3343. Each node in the cluster periodically exchanges sequenced, unicast UDP datagrams with every other node in the cluster. The purpose of this exchange is to determine whether all nodes are running correctly, and also to monitor the health of network links.
• Port 64327/TCP is the default port used for log shipping. Administrators can specify a different port for log shipping.
• For HTTP authentication in which Negotiate is listed, Kerberos is tried first, and then NTLM.

Well it does, but for nodes, not FSW. However, when the single remaining node checks it has quorum it needs to compare the current boot time of the FSW against the time stored in the boottime cookie. How does it get the current boot time? Remote registry, I reckon WMI, which requires RPC.

So… open Windows firewall for RPC, reboot FSW and… bingo. Everything up, sweet as a nut.

We ran the cluster validator (http://technet.microsoft.com/en-us/library/bb676379(v=exchg.80).aspx) and Paul Cunningham’s DAG healthcheck script (http://exchangeserverpro.com/get-daghealth-ps1-database-availability-group-health-check-script/ ) and everything comes back clean.

The moral of this story? Stop being clever.

A great takeaway for everyone is this:

unlike earlier versions of Microsoft Exchange where IT administrators had to perform multiple procedures to lock down their servers that were running Microsoft Exchange, Exchange 2010 requires no lock-down or hardening

From the Exchange 2010 Security Guide, here: http://technet.microsoft.com/en-us/library/bb691338(v=exchg.141).aspx

 

 

edit: if you look at Scott Schnoll’s wonderful high availability deep dive, here, then you will find that the node gets the FSW boot time using WMI, not remote registry.