Recovery Storage Groups are Making Your Life Hell

An interesting call this week. High severity issue with a CCR cluster with geographically separated nodes. The customer was following http://technet.microsoft.com/en-us/library/bb676320%28EXCHG.80%29.aspx, which is the TechNet article on how to patch a CCR to sp1 or 2 – it’s valid for sp3 as well, but MS haven’t updated the article to reflect this. They customer had got to step 9, but things were then going wrong trying to move the cluster from the active node to the passive node (top tip: “active” and “passive” refer to the state of the nodes; don’t use those words to name your nodes, or we will fall out). The move would fail, and then fail to move back to the original node as well, leaving the cluster in a down state. The quick fix to restore service was to shut both nodes off, then power up the sp2 machine. Once the sp2 machine was running, the sp3 machine was turned on. At this point we were called.

First things first was to get a worst case action plan sorted. Sp3 cannot be uninstalled (http://technet.microsoft.com/en-us/library/ff607233%28EXCHG.80%29.aspx) so basically, they would need t ouninstall exchange, evict the node from the cluster, reinstall exchange and recluster. Henrik Walther has documented this process perfectly in his blog: http://www.msexchange.org/articles-tutorials/exchange-server-2007/high-availability-recovery/re-installing-cluster-nodes-exchange-2007-ccr-based-mailbox-server-setup-part1.html. the second part is linked from that page. You can do this with 100% availability, pretty much (except for the failover). Reseeding the db can be done online.

 

Once the customer was happy that we had a backout plan we got some basic troubleshooting evidence collected; a bpa in healthcheck mode, mps reports from both nodes, with cluster and exchange options ticked.

With the collection under way, we started to look at the state of the cluster.

The “clustered mailbox server” tab in the properties of each node showed everything ok – both nodes were listed on each machine, the correct node was listed as operational.

get-storagegroupcopystatus showed all storage groups as healthy, with copy and replay queue lengths of 0, and a timely last inspection timestamp. All storage groups except the recovery storage group, that is. Not supported.

Get-ClusteredMailboxServerStatus (http://technet.microsoft.com/en-us/library/aa998632%28EXCHG.80%29.aspx ) and
Test-ReplicationHealth (http://technet.microsoft.com/en-us/library/bb691314%28EXCHG.80%29.aspx ) likewise showed everything cool.

in system event logs i could see event id 1069, source: clussvc
Cluster resource ‘rsg1db1/sg4db1 (<servername>)’ in Resource Group ‘<servername>’ failed.
more details here:
http://www.microsoft.com/technet/support/ee/transform.aspx?ProdName=Windows%20Operating%20System&ProdVer=5.2&EvtID=1069&EvtSrc=ClusSvc&LCID=1033
this implicates the recovery storage group that they shouldn’t be running in the problem. the recovery storage group that can’t take part in a clustered environment, but is a resource in the cluster. hmmm.

A little digging got me to this:

Databases in an RSG cannot be set to mount automatically when the Exchange Information Store service is started. You must always start the databases manually. If mounted at the time of a cluster failover, databases will not mount automatically after failover is completed.
From:
http://technet.microsoft.com/en-us/library/bb124039%28EXCHG.80%29.aspx
the implication being if the rsg being online is a dependency, then failover will not complete successfully in either direction.

 

Now, the literature all states that while a RSG cannot mount, it doesn’t say that it will prevent failover. However as it was set as a cluster resource (as shown in the 1069 error, above) in this case it will cause failover to crash out when it doesn’t come online.

 

The agreed plan was that the customer would remove the rsg, as per the best practice article here: http://technet.microsoft.com/en-gb/library/aa995895%28EXCHG.80%29.aspx

Once this was complete, they would rerun the prereq tests, repatch the sp3 server to ensure that there was no issue there, and failover the cluster as per step 9 of the upgrade document. With no RSG to bugger things up, this went great. They successfully patched their now passive sp2 node the following day, and away they go…

 

So to summarise, recovery storage groups are making your life hell. If you’re not using them, get rid of them. Don’t have them hanging about on your box. Especially not if it’s a cluster.

Advertisements
Post a comment or leave a trackback: Trackback URL.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: