This is a series of posts:
When designing your service’s infrastructure, you need to remember that your deployment (or scale, more below) unit can go down at any point of time for any period of time. And it doesn’t matter what’s the underlying technology is, whether it’s a Service Fabric cluster, a Kubernetes cluster, or a WebForms application running off Azure Websites aka App Service.
Usually a deployment is to blame, whether it was you or your upstream dependency. Behind a deployment usually there was a change. Then there was a mistake. And then a human being.
A maxim I learned in college (I’m paraphrasing here from Russian, thought) says:
Any found bug is at least one before the last one.
Because human engineers tend to make mistakes while making changes, there always would be one more bug out there.
What you cannot do? Change the human’s nature. What you can do though? Prepare yourself and your service’s infrastructure to a failure.
Let’s consider two scenarios when your deployment has failed:
- It has failed and the service now is in unrecoverable state so you have to delete everything in order to start from scratch. For example, consequent deployment fail with 500 because upstream dependency fails.
- It has failed and the service is in unrecoverable state but you cannot delete everything in order to start from scratch because something blocks you. For example, a security incident has occurred and the security team asks do not touch anything. Or the service team needs time to investigate the reasons for the failure so asks to do not change anything
What you do in either case? The answer lies in the ways how should’ve you modeled your infrastructure to be better prepared.
Let’s divide infrastructure into multiple layers, each with its role and lifecycle, also security and compliance boundaries. Often each layer also corresponds to its own set of secrets (certificates, mostly) that are shared downwards but are isolated upwards.
- Data center
- Scale unit
Let’s describe and explain each of them. The terminology is mine, might diverge from similar but more widely accepted in the industry. I’m happy to adjust it based on the feedback.
Cross-cloud. Super global across all clouds. Everything what’s happening over public Internet. The best example would be public DNS and email. Even sovereign (national) clouds use both public Internet and DNS, until we’re talking about air gapped solutions.
Cloud. Super global within a cloud and across its environments. Same as above but different clouds are now isolated from each other. However, there is still no isolation between environments. It should be relatively rarely used and not be considered to be a permanent solution, until it’s strictly necessary or otherwise impossible. But even so you should immediately start seeking a way to escape it. An example for would be a secret for an external monitoring mechanism when all environments and endpoints are monitored by the single external service.
Global. Considering the existence of the prior two layers, it’s not universally global. But it divides the plane into two principal parts that provide the minimum necessary separation: production and pre-production. An example would be a secret for AAD application, which has Prod and PPE versions of it. Or root DNS zone service.example.com.
Environment. Separated from one another by various physical boundaries, share nothing in common. For example, the Integration environment uses DNS zone int.service.example.com while the Test environment uses test.service.example.com.
Data center. In other words, a region in a cloud. Represents all the resources and the secrets that are necessary to serve traffic (or do other work) in particular geographical location but those which are not a part of a scale unit (see below). What means that there resources and secrets will be created before a scale unit is created and will continue to exist if a scale unit is deleted. Each environment would consist of at least one (or more) such data center. They can be further grouped into pairs or subdivided into availability zones. The candidate resource types would be Key Vaults (you don’t want to recreate secrets every time), Managed Identities (for same reason), IPs (created once will act as static), regional DNS records (e.g. westus2.int.service.example.com), Traffic Manager profiles this DNS record is a CNAME to.
Scale unit. The smallest unit of deployment. On-prem analogue would be a server, in the cloud it’s a VM scale set, a Service Fabric cluster, a Kubernetes cluster, etc. Groups all the resources needed to create such cluster. These resources should be deleted and recreated all together if something goes wrong. Each data center would consist of at least one (or more) such scale unit. The reasons for creating more than one would be: scalability, when one cluster is not enough to sustain the load, and reliability, when one goes down and you cannot failover traffic off the region.
To be continued…