Tag Archives: troubleshooting

Whatever Happened to This Likely Lad?

So in August 2013 Jeff Dailey, the Director of Diagnostics at Microsoft Support was talking about Microsoft Fix-It Center Pro on Channel 9. He was really excited about it too, and who can blame him? It was really exciting stuff. Three months later, it was dead. I don’t know why. Like white dog poo, it’s a mystery. However, the automated diag packages remain, and are still being added to.

What’s the point?

If you’ve never used them, you’re really missing out. You may recall the old MPS report of some time ago, which gathered a huge amount of evidence from a machine and then just dumped it into a cab file. This was great, if you knew what to do with the data, but otherwise it was just 40MB of confusion. Some of it was obviously useful, like the event logs and cluster logs, some of it less so – do I care about symbols? No, not really. I’m shallow like that. There was the MPS report parser tool for the early implementations, which basically trawled through all the text files looking for the word “FAIL”, then later the MPS Report Viewer or you could use the old manual method.  But by and large, to get anything useful out of them, you had to have a pretty good idea what you were doing. Not any more. No longer do you have to know the meaning of:

00001df3.000016bd::2012/10/07-10:40:20.271 INFO  [GUM] Node 3: Processing GrantLock to 3 (sent by 2 gumid: 1249)

 

I exaggerate. Probably best if you know a little. These tools will sit on your poorly machine, run for a few minutes (maybe half an hour) and then tell you in short clear phrases *what is wrong*. Mostly. And if they don’t, hell , you’ve got all the evidence collected.

1-1

How to use them properly.

Go to the Support Diagnostics Website. It still retains the ghost of fix-it center pro in the title “ficp”. You will need to log in with your live id (a Hotmail account in other words).

Select the relevant package from the large number available – I lost count at fifty. I’m rather fond of the “Exchange Server 2007 and Exchange Server 2010 Diagnostic” but I’ll use the Windows performance Diagnostic for this post. Click on the diagnostic tool and give it a name.

1-2

You’ll need to download it and follow the instructions:

1-3

You may see references to “Microsoft Automated Troubleshooting Services” and “Microsoft Fix-it”. You’ll probably want to be an administrator to run it as well.1-4

Follow the steps as requested by the tool. It will ask if you want to install on the local machine or another machine, in which case it will create a portable diagnostic utility – more of this later. After a couple of minutes it will suggest you run a diagnostic tool to collect information from your computer. Do so. You can then wait up to an hour for the utility to do it’s thang.

 

Once it has finished, you’ll get the chance to check through the files it has collected and if necessary stop them being uploaded to Microsoft.1-5

Once you click “next” it will compress the files, and then give you a chance to save a copy of the cab file before uploading it to Microsoft.1-6

Click “send”, and then sit back and wait.1-7

If you go back to the Support Diagnostics page and click on the “recent sessions” tab, after a while (five minutes or so) you will see your upload has been received:1-8

But not yet analysed. This takes a couple of hours, usually, but keep checking back, and you’ll eventually see “completed”1-9

Click on the link, and see what the problem was:2-1

If you’re stuck, it gives you the option to “Get Assisted Support”. This will possibly (probably?) cost you money.

If you open the cab file you saved earlier (you did save it, didn’t you?) then you will see a whole heap of files. Some of them are clearly recognisable, some of them less so. The file you are after is called “resultReport.xml” – open this up in Internet Explorer, and bask in its troubleshooting goodness.2-2

Look at the things that is checking for! Networked PST files. Dodgy versions of SEP (SEP 1-SEP n, basically). Fantastic.

Click on the links for the issues that were found:2-3

 

And better yet, here’s where you get to make sense of the files it collected. Scroll down and expand detection details2-4

And then below that, there are links to all the evidence files you gathered:

2-5

But if you want, you don’t need to upload them:

Go to https://wc.ficp.support.microsoft.com/SelfHelp?knowledgebaseArticleFilter=

Open link for the directed report generator

Click run

“save this file”

“click run”

Accept agreement

Select “a different computer” and tick “this machine has powershell in it”, if applicable.

Read the instructions and follow steps 1-3. Do not follow step four yet.2-6

Save the tool to a local disk on the machine to be investigated, and run it (preferably as an administrator).

2-7

Accept the license agreement, and the following screen will appear briefly, and then disappear. Nothing will happen for 15 seconds or so.2-8

You will then be asked to run the tool:2-9

Click start. The tool will take about 10-15 minutes to run, in some instances.3-1

When it finishes, you’ll see the following:3-2

Click next, and select a location to save the file. This can be a network drive.3-3

When it finishes creating the cab file you will see the following screen:

3-4

Click “close” and browse to the location you saved the evidence. Extract the cab file and enjoy resultreport.xml. I know I will.

I hope this is useful to you. I love these tools, and think they’re much ignored, outside of Microsoft, anyway…

 

Failing databases, sulking network manager

Interesting call here. After a hardware firewall change and a reboot, my customer’s DAG had a database copy in a failed state. The set up is a two node DAG across two sites, with a FSW and an Alternate FSW preconfigured. It’s also IL3. If you don’t know what IL3 is, please stop reading this article. You don’t have clearance. Look out the window. See the guy with the dark glasses watching you? No? THAT’S how good we are.

So, just change the firewall back, dummy. big deal. sheesh. Except, it didn’t fix the problem. interestingly – this is the first time that cluster failover has been tested… DAG has been tested a number of times.

So… he’s got one database copy mounted, one failed.
We ran get-mailboxdatabasecopystatus and saw this error; “replication server encountered transient network error. Network manager not yet initialised”
It’s been in this state for a while now, and through multiple reboots. Sitting watching it won’t help. It’s not really transient.
Oh right, so the FSW needs rebooting, right? I’ve seen this before…(http://port25guy.com/2012/12/10/witness-server-boot-time-getdagnetworkconfig-and-the-pain-of-exchange-2010-dr-tests/). No. The boottime cookies are the correct way around.
So we start checking things. the IP addresses show as down in the DAG. This picture is not their DAG. IL3, remember?

The cluster node shows as “down” in the failover cluster manager.
So, let’s see what happens when we try to start the node. Lots of errors in the event log (which I can’t see… IL3…), but one sticks out like a sore thumb – event id 4123:

Log Name: Application
Source: MSExchangeRepl
Date: 2/26/2012 11:12:08 AM
Event ID: 4123
Task Category: Service
Level: Error
Keywords: Classic
User: N/A
Computer: LABMBX-1.exlab.mydomain.com
Description:
Failed to get the boot time of witness server ‘labcas-1.exlab.mydomain.com’. Error: The remote procedure call failed. (Exception from HRESULT: 0x800706BE)

There’s a great big clue right there. “the remote procedure call failed”. For some reason the endpoint mapper on the FSW isn’t responding. This is a resource domain which just contains a DC, the two Exchange boxes and a Vcenter manager. (I did mention the VMWare, yes?) What is the FSW machine? Well, it’s the Vcenter console machine in the domain.

And there is the problem.

When you install exchange on a box, it adds a security group to the local admins group, and makes changes to the windows firewall (http://marksmith.netrends.com/Lists/Posts/Post.aspx?ID=83). When you put the FSW on a NON-Exchange box, you need to add the exchange trusted subsystem group to the local admins manually – you’ve not installed exchange, so setup won’t do it for you. It’s documented here: http://technet.microsoft.com/en-us/library/dd351172.aspx

If the witness server you specify isn’t an Exchange 2013 or Exchange 2010 server, you must add the Exchange Trusted Subsystem universal security group to the local Administrators group on the witness server. These security permissions are necessary to ensure that Exchange can create a directory and share on the witness server as needed. If the proper permissions aren’t configured, the following error is returned:
Error: An error occurred during discovery of the database availability group topology. Error: An error occurred while attempting a cluster operation. Error: Cluster API “AddClusterNode() (MaxPercentage=12) failed with 0x80070005. Error: Access is denied.”

What it doesn’t say, but assumes, is that RPC will work. Why does it need RPC? It’s just a fileshare, yes? It doesn’t say anything about RPC here: http://technet.microsoft.com/en-us/library/bb331973.aspx

• The Clustering data path listed in the preceding table uses dynamic RPC over TCP to communicate cluster status and activity between the different cluster nodes. The Cluster service (ClusSvc.exe) also uses UDP/3343 and randomly allocated, high TCP ports to communicate between cluster nodes.
• For intra-node communications, cluster nodes communicate over User Datagram Protocol (UDP) port 3343. Each node in the cluster periodically exchanges sequenced, unicast UDP datagrams with every other node in the cluster. The purpose of this exchange is to determine whether all nodes are running correctly, and also to monitor the health of network links.
• Port 64327/TCP is the default port used for log shipping. Administrators can specify a different port for log shipping.
• For HTTP authentication in which Negotiate is listed, Kerberos is tried first, and then NTLM.

Well it does, but for nodes, not FSW. However, when the single remaining node checks it has quorum it needs to compare the current boot time of the FSW against the time stored in the boottime cookie. How does it get the current boot time? Remote registry, I reckon WMI, which requires RPC.

So… open Windows firewall for RPC, reboot FSW and… bingo. Everything up, sweet as a nut.

We ran the cluster validator (http://technet.microsoft.com/en-us/library/bb676379(v=exchg.80).aspx) and Paul Cunningham’s DAG healthcheck script (http://exchangeserverpro.com/get-daghealth-ps1-database-availability-group-health-check-script/ ) and everything comes back clean.

The moral of this story? Stop being clever.

A great takeaway for everyone is this:

unlike earlier versions of Microsoft Exchange where IT administrators had to perform multiple procedures to lock down their servers that were running Microsoft Exchange, Exchange 2010 requires no lock-down or hardening

From the Exchange 2010 Security Guide, here: http://technet.microsoft.com/en-us/library/bb691338(v=exchg.141).aspx

 

 

edit: if you look at Scott Schnoll’s wonderful high availability deep dive, here, then you will find that the node gets the FSW boot time using WMI, not remote registry.