Keeping data identical at two locations is becoming increasingly important in a highly available IT world. A couple of years back in time it used to be an expensive enterprise level luxury. But recently that demand can be found in SMB environments too. The method is called mirroring which can be implemented in two ways.
Asynchronous – Data is being synchronized in defined intervals. In between there is a difference (delta) between source and target.
Synchronous – Data transfer is transaction consistent. I.e. the data is identical on both sides at all times. A write operation is only considered complete when source and target page have confirmed the write.
A prerequisite for high availability is mirroring of data (synchronous or asynchronous). If the data is available at two locations (data centers), a further design question arises: Should the storage target act as a fallback copy in case of emergency (Active-Passive), or should the data be actively used in both locations (Active-Active)?
Active-Passive – Only the active side works and data is transferred to the passive side (synchronous or asynchronous). In case of a failiure, the system switches automatically or manually and the previously passive side becomes active. It remains so until a failback is triggered. This method guarantees full performance even in the event of a total site failure. Resources must be equal on both sides. The disadvantage is that only a maximum of 50% of the total resources may be used.
Active-Active – Resources of both sides can be used in parallel and the hardware is utilized more efficiently. However, this means that in the event of a failure, half of the resources are lost and full performance cannot be guaranteed. Active-Active designs require a synchronous mirror, as both sides have to work with identical data.
Active-Active clusters do exist in many different forms. There’s classic SAN storage with integrated mirroring, or software defined storage (sds) where the mirroring is not in hardware but in the software layer. One example is DataCore SANsymphony. VMware vSAN Stretched Cluster plays a special role and will not be covered in this post.
In the following section I will discuss a special pitfall of LUN based active-active constructs, which is often overlooked, but can lead to data loss in case of an error. VMware vSAN is not affected because its stretched cluster is based on a different design which prevents the following issue.
None of the issues above did fit my observed problem. A good startpoint should be a look into vua.log on the affected host.
Unfortunately that didn’t help either. So we had (again) a closer look at the VMware upgrade path matrix. A direct host upgrade from ESXi 6.0 to ESXi 6.7U3 is supported but while we re-checked the matrix our attention was drawn to a little footnote.
KB 76555 says there’s an issue with expired VIB certificates on hosts below a specific build numer.
ESXi 6.0 GA before build 9239799
ESXi 6.5 GA before build 8294253
In fact our ESXi host 6.0 had a build level of 7967664 (U3e) which is in the critical range. So we had to install some patches up to July 2018 (ESXi600-201807001). After that the upgrade to ESXI 6.7U3 went flawlessly.
What went wrong?
Of course we did check the matrix during the planning phase in early March 2020. That’s a standard operating procedure. Unfortunately something has changed in the meantime (the footnote was added). KB 76555 was updated in May 2020 and the issue affects upgrades to versions of ESXi 6.7 beyond April 28th 2020.
Take home message: Check your design and matrices again right before the projects starts.