This is a series of posts:
Now you have multiple environments, each consisting of multiple data centers, each consisting of multiple scale units. How do you wire up them all together to be well prepared for a disaster?
There are various kinds of services (stateless and stateful) so are the patterns of traffic (inbound and outbound) they serve. I’m lucky enough to work with mostly stateless services that serve inbound traffic. That it, there is no state per se and the data to be processed is the HTTP requests coming from the users over the internet. Namely, an ARM resource provider (RP) for the Azure Device Update (ADU). Thus below I’ll explain how to use Azure Traffic Manager (ATM or colloquially just TM) to route traffic to this kind of services. Other kinds might require a different model.
The model I’m proposing here is rooted in two aspects described earlier:
- Each data center had multiple scale units
- Each data center has its failover pair
First, the reliability of a data center. TM works just fine and routes traffic to a single scale unit in a data center, meanwhile being ready for the second one to be stood up and added to the rotation.. Thanks to the probes that run periodically (what is easily configurable), check each endpoint and mark it active (or not). The priority mode suites this option the best as the first endpoint would have priority 10 and the second would have 20. The numbers are arbitrary but you got the idea. The endpoints with higher priority kick in only when the those with lower are down.
Normally, if you have just one cluster up and running, the second endpoint will be always inactive and traffic will be always served by the cluster behind the first endpoint. In case of emergency, if you have to delete that cluster, you create another one. Its DNS/IP are known in advance and already preconfigured on the TM profile. This way you won’t need to do anything and it’ll start serving traffic immediately.
Another option is to have two endpoints and two clusters always up, running and serving traffic. It’s needed when there are any technical limitations or other considerations why one cluster is not enough. In this case the weighted mode with the same weight for both cluster works well.
You’ve secured the reliability of a single region: one cluster goes down, another takes up its place and continues to serve traffic. Now let’s shift the focus and see what happens if not just one cluster but the whole region goes down? This is less likely to happen until a really bad deployment takes place, likely of your own than not.
Azure has grouped regions into called failover pairs. What means that by the contract there won’t be a deployment to both regions simultaneously and at least one will stay healthy. For you that means that you can have another TM profile with two endpoints in the priority mode:
- The first endpoint is in Region A, e.g. West US with DNS westus.service.example.com
- The second endpoint is in Region B, e.g. Easy US with DNS eastus.service.example.com
If Region A is completely down, what would happen only when all clusters in that region by some reason became unavailable, then only traffic will be routed to Region B. What has its own complications such as increased latency, increased load on what’s now a single region with doubled traffic, what again increases latency. But serving customers slowly is better than not serving them at all.