Monthly Archives: May 2013

Load balancing algorithms and Exchange 2010

Last week one of our customers had a service outage brought on by their load balancer. It wasn’t misbehaving, it was doing exactly what it was supposed to do, and in doing so made sure that no users could connect to their mailboxes. Good work, load balancer. As part of the mop-up, I was asked what Microsoft recommend with regard to load balancers.

First – the problem. One of the CAS that the load balancer was ummm… balancing became unhappy – not so unhappy that it couldn’t respond to requests, but unhappy enough that it couldn’t service them, so users would disconnect. Unfortunately, this meant that as far as the load balancer was concerned, it was available – it was still responding, right?

The load balancer looked at its farm, and each new request that came in got sent to the server with the least load. Load *balancer*, see. The clue is in the name. Guess which server has the least load… and in fairly short order, everyone is disconnected.

So, how do you get around this problem? Well, it depends on the load balancer and how clever it is, but without relying on cleverness, the answer is to use round-robin load balancing. Is it perfect? No. Is it better than having all your users disconnected? Yes.

 

And what do Microsoft do? Well, in a way, it doesn’t matter what MS do – their setup and budget is probably very different to yours – but they have chosen to go with round robin, as detailed in this presentation from Andrew Ehrensing, given at TechEd Australia in 2010:

http://channel9.msdn.com/Events/TechEd/Australia/Tech-Ed-Australia-2011/EXL304 slide 32.

 

The slide notes say this:

“Outlook.com / MSIT learnings around Round Robin

1.) When using least connections, 3 node pool.   When server goes down for maintenance and comes back online, gets POUNDED with new connections

2.) When a server “misbehaves” it may be “healthy by the LB healthcheck” but not processing new inbound connections, but all new connections keep directed there causing an outage”

 

I’d certainly not suggest you ignore the load balancer manufacturer’s recommendations, either. But be aware that a problem that may appear a little esoteric can actually occur in the real world.

 

So now you know. As an aside, this customer has experienced two outages in recent months which have been caused by things performing poorly – not badly enough that they completely fail and trigger high availability, but still not good enough to provide a service. Most high availability relies on fairly simple tests to see if a service is available –“can a connection be made on a given port?” – with no regard for whether something can be usefully performed through that port. This is great if the service fails, or the server crashes, but not so good if, as in this instance, authN is broken. High availability features are great, but they do not replace the need for proper planning and effective monitoring, which would have saved this particular customer.