Reliable and scalable infrastructure: Secrets

This is a series of posts:

In the previous post we’ve discussed probably the most important aspect of running a service – the handling of live traffic. Without it it’s a not a service but a bunch of robots wasting your time and money.

Now let’s discuss the next most important aspect – the secrets. Perhaps, a service can run without any. But only if it’s a static website. But even a static website need an SSL certificate, so… Any real-world application needs to write its data somewhere, e.g. to a database, or read another application’s data, e.g. from a web service. So it needs to authorize, so it needs a secret (whether it’s a password or a certificate), so it needs to keep it somewhere and access it from there somehow.

As mentioned above, the main dimension for secrets is the type (or kind): passwords and certificates. One even can combine them into one by requiring a passphrase to access the private key of a certificate.

Going forward we’ll be discussing certificates, they’re the primary authorization mechanism employed by modern web applications in the cloud.

What rather really matters is the difference in the mechanism to store and access one or another. For example, there is the whole subsystem called KPI for certificates while there is basically nothing built-in for plain text passwords, such as those used to access AAD applications.

Another important dimension for secrets to discuss is the regionality: whether a secret is unique within each individual region where your service is hosted, or it’s shared by a group of regions (let’s say North America, Europe, Asia), or by all regions (what effectively makes it global).

Similarly to the least privilege access principle, the idea is to scope a secret down as much as possible, as it’s technically feasible to a single region. Ideally, all secrets are regional as long as they can be. For example, SSL certificates. The opposite would be a certificate used to encrypt JWT (aka JWE). Since the encryption in this case is symmetric, the same certificate must be used to decrypt the payload. Thus making such certificate a global secret.

There is no single obvious reason to prefer one strategy over another, each has its pros and cons. Such as:

  • The growing number of secrets increases the overall cost of maintenance.
  • You’ll need a secrets inventory, which then must be kept up to date. Otherwise it defeats the purpose of having one.
  • More certificates will expire more often, so you’ll need to keep eye on every one of them.
  • On other hand, a breach in one region would not automatically mean a breach in another, or what’d the worst – in all. Means your whole production environment can (or cannot) be taken over.
    • If this ever happens, you want to be able to shut the attacked region down, fail over the traffic, and handle it without affecting the customers.

Too many certificates that expire too often is a hell of thumbprints to update in the configuration. Right? Wrong. If so, what should one do instead? Instead one should switch to the validation by subject name and issuer (or SNI, for short).

In this case the service (or the underlying compute) trusts the root certificate and subsequently – all certificates issued under the umbrella of this trusted root. As the result, it doesn’t matter how often a certificate expires and what’s its thumbprint, the service continues to use and trust it regardless.

One of important nuances thought is the recommendation to renew (aka roll) the certificate in advance, earlier than it would expire. This way you give yourself enough time to handle any errors that might occur during the renewal before the certificate expires and causes an outage.

Last but not least aspect to discuss is the separation of secrets delivery from secrets consumption. It’s less methodological, more technological and practical, and still provides important advantages. How exactly? A naïve implementation of consuming a secret involves fetching it first. But is it really necessary, can we do better? Yes, we can.

In order to follow the Single Responsibility Principle (SRP, for short) and encapsulate each function, we can split them into two:

  1. Fetch a secret (in this case, a certificate) from a remote location, such as Azure Key Vault, and install it into a local store. For the code that does that, it doesn’t matter how and when the secret will be consumed, its role ends here.
  2. Read the secret from a local store, For the code that does that, it doesn’t matter how and when the secret was fetched, its role starts here.

Practically speaking, it means that these two operations can be performed not just on two different timelines but by two different applications, written by diffent peole, using diffent platform and/or programming languages. Basically, this is the micro-services architecture applied to the secrets.

P.S. I’d like to thank and acknowledge Andrey Fedyashov, my fellow colleague at Microsoft and friend, who shamed me (I mean encouraged) into finishing this series.

This entry was posted in Infrastructure and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.