Why Azure SQL DBs become Inaccessible

tl;dr Because you lost the access to Azure Key Vault (AKV) or Managed HSM. And you can find the culprit for the problem in your Error Log. Just search for “AzureKeyVault” and it’ll give you a clue. For those looking for super-short answer, well, there you go. Restore the access, revalidate the key and you’ll be good to go.

Let me share my credentials first – I’ve been working as an on-call engineer for Azure SQL Managed Instance and Azure SQL DB for close to two years. One of the areas that my team owns is the Transparent Data Encryption (TDE) which means we are handling related incidents as well. So, everything that I write here is from very first-hand experience and real-life issues that we’ve triaged, debugged and solved.

As I said above, the main reason why Azure SQL DBs become inaccessible is because your database has lost access to Azure Key Vault (or Managed HSM). And trust me when I say this, no matter how unbelievable it sounds, it always IS the case. The Inaccessible state was literally invented for databases that lost the access to their Primary Encryption Protector (more about this later on).

Here are some of the most common scenarios that I’ve observed:

  • Too restrictive NSG rules on the Subnet that start blocking access to AKV (or Managed HSM)
  • Switching to Private Link to AKV, but not configuring it properly
  • Removing Identity and losing access to AKV
  • Moving your database(s) to another tenant and losing access to AKV again
  • Not counting on VLFs that are still encrypted with older key (what I refer to as “latent keys” later on)
  • Introducing too restrictive Firewall on AKV side
  • Setting up GeoDR scenario where Geo-secondary is not properly configured to work with AKV keys
  • Misconfiguring DNS servers such that Secondary DNS’ return inconsistent results
  • Restoring database from server that used different encryption key
  • Switching to TDE-SMK and immediately revoking access to TDE-CMK
  • etc.

In many cases it could even be a combination of multiple factors, but at the end of the day it always boils down to the fact that Database couldn’t access the encryption key; and all hell broke loose from there.

Anyway, before I jump into all the variations, it’s worth covering how the Transparent Data Encryption (TDE) works in the first place. This is because I’m truly convinced that if you understand the very simple dynamics behind it (yes, it really boils down to very simple stuff) then you can pretty much prevent most of the issues that are occurring in the first place. So, let’s talk about that.

Very simple explanation of how TDE works

Let me prefix this by stating that this is, intentionally, a very oversimplified explanation. It does, however, communicate the most important things that everything else builds upon.

At the very basic level, you have two places where database stores your data – MDF and LDF files. MDF is pretty much the current state of your database (hence the Main Database File or MDF for short), whereas the LDF keeps record of everythng that led up to current state (hence the name Log Database File or LDF for short). Disecting them could easily eat up the whole book so let’s just keep it at the high-level for now.

By default, any data that you store to database (e.g. into your tables) is unencrypted by default. This further means that it’s stored unencrypted into MDF and LDF files as well. Further meaning that anyone who gains access to these files could easily read all the data, even if they weren’t supposed to. As you can imagine, this isn’t ideal. So let’s welcome the Transparent Data Encryption (TDE) into the game.

As the name implies, TDE ensures that your data is encrypted before being stored to the MDF and LDF files. That’s the “Data Encryption” part. The “Transparent” implies that you, as a user, don’t have to worry about it and that database handles all of this for you. Data is encrypted right before being stored to MDF (or LDF) files (i.e. to Storage Medium) and decrypted befoe being pulled into the working memory.

From a very high-level perspective, the mechanics of TDE are very simple. You have a thing called “Database Encryption Key” (DEK) which encrypts your data in MDF and LDF files. So, right before your data is written to disk, Database Encryptio Key (DEK) encrypts it, and right before it’s pulled into your working memory, it’s decrypted. Pretty simple.

Interesting thing about DEK is that it’s NEVER stored to storage medium unencrypted. Never ever ever. The only time it’s unencrypted is while it’s in the process’ working memory, but never on the storage medium itself.

So if DEK encrypts your actual data, there has to be something that encrypts the DEK, right? And indeed there is – that something is either a Certificate or Asymmetric Key (which I’d argue boils to the same thing eventually, but that’s a different story). In Azure SQL world, we refer to these two things as TDE-SMK and TDE-CMK. TDE-SMK stands for TDE with Service-managed Key, whereas TDE-CMK stands for TDE with Customer-managed Key. I guess you could figure it out based on the name, right? The former is managed by ourselves (i.e. Azure) and we ensure that it’s up to date, rotated regularly, always available, etc. Whereas the latter is managed by yourself (i.e. the Customer) and it’s up to you to ensure that it’s up to date, rotated, backed up properly, etc. TDE-CMK is occasionally referred to as Bring Your Own Key (BYOK) because it boils down to, you know, yourself providing your own key.

Given that this article is about Inaccessible state, which by design happens only if you are using Customer-managed Key (i.e. TDE-CMK / BYOK), I’ll focus the rest of the article on that one.

If you opt for TDE-CMK, your key needs to be kept in some external service. That external service is usually either Azure Key Vault (AKV) or Managed HSM. From Azure SQL’s point of view, they provide the exact same functionality, but their inner workings are quite different which, again, is a story on its own.

Going back to how data is encrypted, we have Database Encryption Key (DEK) protecting our raw data, and then on top of that, we have an asymmetric key (stored in AKV or Managed HSM) that encrypts the DEK. As I said before, the DEK is never stored unencrypted. It’s always kept encrypted by the asymmetric key that you provide.

The general workflow is as follows – when you start up the database, we first check if it’s encrypted by DEK. If it is, then we need to get it decrypted first. This is done by contacting your key management solution and asking it (politely) to decrypt it for us. Once it’s decrypted, we only keep it in the working memory while the database is alive and active. The moment that process dies, we lose access to DEK and we have to have it decrypted again on startup. This turns out to be rather important.

Now where’s the catch? Well the catch is in the fact that if your key management solution is unavailable due to whatever reason (to be discussed later on), we can’t really decrypt the DEK, which in turn means that we can’t really do anything with your data. This effectively renders the database unusable, but we opted for a fancier term – Inaccessible.

So, as I said at the very beginning of the article – having your database show as “Inaccessible” really means that Azure SQL lost access to key management solution (AKV or Managed HSM) and as a consequence, the Database Encryption Key (DEK) can’t be decrypted and we can’t really do much about it except asking you to restore the access.

Now I can totally hear some of you thinking – well, there must be cases where database goes to Inaccessible state due to some bug on your side. But trust me when I tell you, I’ve seen tons of weird cases, and 99% of them were due to some kind of misconfiguration (which I’ll talk about later on). I did see cases where the database became inaccessible due to, say, AKV’s regional outage, or Azure Networking’s outage or due to some kind of failure of our underlying infrastructure. So yes, it does happen, but it’s incredibly rare and we are usually aware of it and we restore all databases as soon as we fix the underlying issue.

I’m not saying this to play a blaming game or anything. Really. The sole reason I say this is just because if you see your database(s) marked as Inaccessible, your best bet really is to look into your key management solution and see what is preventing the access to it. And I’ll focus now on some most common scenarios that we observe.

Common causes of DBs becoming Inaccessible

As I mentioned in the beginning of this article, everything that I state here is based on actual cases that we debugged and triaged during the on-call shifts. I’ll try splitting them by categories, but I do suggest you at least skim through all of them in order to gain enough awareness.

One important thing to keep in mind when debugging the TDE-CMK related issues: your Error Log WILL ALWAYS tell you if the issue is related to Azure Key Vault. Always. It might not tell you the EXACT cause, but you should be able to at least distinguish the category of the problem.

Networking Issues

These are, by far, the most common cause of Databases going to Inaccessible state. Reasons are honestly way too many and I could likely dedicate a full article covering all the peculiar networking-related cases that we’ve observed, but I’ll try to keep it short here.

It always boils down to the fact that Azure SQL can’t connect to your Key Vault which means that we can’t decrypt your Database Encryption Key (DEK) and you get the rest. Your database becomes Inaccessible as a consequence of that.

Here are some common issues that we observed:

  • Misconfigured NSGs rules. This is, by far, the most common cause. You apply NSGs to the subnet where your SQL instance is, and those rules turn out to be too restrictive, eventually blocking the traffic towards AKV. What’s more, in most of the cases, you had NSGs that previously allowed traffic, but someone modified them and made them more restrictive.
  • Misconfigured Private Link. This is also a very very common one. You switch to using Private Link and Private Endpoint to access the AKV, but it either ends up being misconfigured from the start or someone applies restrictive NSGs that suddenly start blocking the traffic via Private Link.
  • DNS Shenanigans. This one is always amusing to debug. What usually happens is that you usually configure multiple DNS servers, where Primary DNS usually returns proper IP address for AKV, but Secondary servers, which get queried rarely, return different IP. If that happens, it usually turns out that it returned Public IP or some other IP which eventually gets blocked. These ones are generally tough to debug and predict and I’ll talk more about it under dedicated section below.
  • Firewall on the Key Vault side. This is also a fairly common one – you apply (too) restrictive access on the AKV, and this starts dropping the traffic from your Azure SQL instance.
  • Hidden Appliances. This is a term I made up. It refers to the range of Network Virtual Appliances that you can integrate in your network. What usually happens is that you add some kind of firewall or anything that in turn starts dropping the traffic between Azure SQL and your key vault.

As you can see, the root causes are many, and I do mean MANY. But the good news is that, in my experience, they are quite easy to test against. Here is a link to the “TCP Connectivity Checker” article. In a nutshell, it boils down to executing these PowerShell commands:

Test-NetConnection your.akv.host.com -Port 443

Invoke-WebRequest https://your.akv.host.com/healthstatus

In case there are any Networking issues, one of these commands will tell you. Usually, if Test-NetConnection fails, that means that most likely you misconfigured your NSGs. Whereas, if Invoke-WebRequest fails, it usually implies issues with NVAs, like Firewalls for example.

Identity Issues

Another common issue that we observe is Managed Identities (both Service-managed and User-managed) being dropped. This is problematic because dropping the Identity means you lose access to the Key Vault, which in turn calls for immediately sending database to Inaccessible state.

Good news, as usual, is that you can immediately spot this kind of errors in your error log. It usually looks like following:

2024-10-19 12:48:38.18 spid91s     [AzureKeyVaultClientHelper::CheckDbAkvWrapUnwrap]: Failed WRAP access check with msg: [error:128, info:400, state:0] The error code is missing a error message. MoreDetails: AADSTS700016: Application with identifier 'xx-xx-xx-xx-xx' was not found in the directory 'Xxxx'. This can happen if the application has not been installed by the administrator of the tenant or consented to by any user in the tenant. You may have sent your authentication request to the wrong tenant.

To resolve the error, either restore User-assigned Managed Identity or recreate Service-assigned Managed identity and grant access to Key Vault.

Also be mindful of the fact that messing with Identities might cause you to lose access to Microsoft Entra ID, which could lead to most (or all) of your logins failing. I guess it’s safe to say that you don’t want to mess up the Identity setup.

Latent Keys

Latent Keys, which is a term I fully made up, is an especially nasty one. Namely, the trouble is that changing the encryption protector (by either moving to TDE-SMK or by switching to another key), DOES NOT mean that your database immediately switches to it. Let me repeat that:

Changing encryption protector (e.g. switching to new Key or to TDE-SMK) is not instantaneous operation! You need to keep access to the old key for a little while, or else you are at risk of having your DB go Inaccessible.

Why do I say this? Because we’ve seen tons of issues where you, say, switch to new asymmetric key and immediately revoke access to the old one. What actually happens in the background is that there could likely be some VLFs that still need access to old key and failing to provide this access results in Inaccessible state.

In order to ensure that you are safe to drop the old key, you need to query against sys.dm_database_encryption_keys and sys.dm_db_log_info. Good rule of a thumb is – if you see the thumbprint of the old key in either of the views from above, you likely do not want to revoke/remove it just yet. Wait until it’s completely gone and then get rid of it.

The DNS Doom

I’m quite certain that anyone who ever had to debug an inexplicable system failure eventually learned that it’s in one way or another related to DNS. And guess what – it’s no different with Azure SQL databases either!

DNS is, generally speaking, a common denominator for most of the stuff. Whenever you are trying to reach an AKV (or Microsoft Entra) you eventually have to ask DNS to resolve the hostname to an actual IP address. And if that resolution goes wrong – well, you are in a pickle, because your connection will travel via wrong path.

One of the tricky cases I’ve observed recently was of an Azure SQL Managed Instance that had two DNS servers set up. One was Private DNS and another one was Azure Public DNS. They were using Private Endpoint to connect to Key Vault, and they had a GeoDR setup as well. The problematic instance was Geo-secondary one.

What they reported is that every once in a while, without any predictable or explainable cause, their failover group drops. It literally appeared as if it was dropping at completely random times. Quite weird indeed.

What turned out to be the case was that MOST OF THE TIME, the Private DNS would properly resolve the Key Vault’s IP to a Private IP one. And when it did – all worked fine.

However, every once in a while, Private DNS would fail to respond in time and it’d fall back to Secondary DNS, which, in this case was Public one. And the public one would respond with Public IP for the Key Vault, eventually sending the connection via Internet and Public IP.

On the other side of this, they AKV firewall rejecting any connections that weren’t coming via Private Endpoint. And I guess you get the gist at this point? What happens once in a while is – Private DNS fails, connection travels via Public IP, AKV rejects the connection due to firewall and responds with HTTP 403, and SQL MI treats this as “Key has been revoked” and it triggers Inaccessible workflow immediately which among other things, breaks the Failover Group. Bummer. The proper solution would either be to add additional redundant DNS servers, or introduce Azure Private DNS zone so that Public DNS responds with Private IP.

Anyway, as you can see, DNS misconfigurations can lead to lots of trouble. But just like with Networking Issues, you can easily test for them using PowerShell and SQL Agent. You can read more here, but the gist of it is that you want to execute the following PowerShel commands through SQL Agent:

Get-DnsClientServerAddress

This will return list of DNS servers that are configured inside your instance, and then you can see the IP they reply with, using the following command:

Resolve-DnsName bing.com -Server 1.2.3.4

Where the “Server” is basically an IP of DNS server(s) from the previous command. This will quickly tell you if you have a mismatch.

Private Link Issues

If I had to pick, I’d say this is likely one of the most common issues we observe. You opt in for using Private Link to AKV, in order to effectively bring AKV service into your VNet, which is great. The only trouble is that if it’s misconfigured, it completely cuts off the access to the key, and your database eventually ends up in Inaccessible state.

There’s literally a plethora of things that can go wrong here, and sometimes it’s hard to pinpoint what it is. But I’ll try to outline some of the common scenarios.

The most common one that we observe is that the NSG rules applied to your subnet are rejecting the traffic to Private Endpoint. This is usually caused by the fact that you previously used to use Public IP and you configured your NSG rules to allow traffic towards Azure’s Public Services. But once you switch to Private Link, your AKV hostname starts resolving to a Private IP address which in turn gets blocked and you get the rest.

Like with most of the issues above, the simplest way to test for this is by using the TCP connectivity checker after you’ve switched to Private Link and ensure that traffic can flow freely to the destination.

In Summary

And there you go. List of most common issues, along with some advice on how to avoid the Inaccessible doom. I hope you found it useful!

Where to learn more

There is a tons of great info on our official page and Docs team is always making sure to keep this up to date so do make sure to check this out.

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top