VMware Bitfusion and Tanzu – Part 2 : Bitfusion server setup

This will be a multi-part post focused on the VMware Bitfusion product. I will give an introduction to the technology, how to set up a Bitfusion server and how to use its services from Kubernetes pods.

Bitfusion Server setup preparation

A Bitfusion Server Cluster must meet the following requirements:

  • vSphere 7 or later
  • 10 GBit LAN at least for the Bitfusion data traffic for smaller or PoC deployments. High bandwidth and low latency are essential. 40 Gbit or even 100 Gbit are recommended.
  • Nvidia GPU with CUDA functionality and DirectPath I/O support:
    • Pascal P40
    • Tesla V100
    • T4 Tensor
    • A100 Tensor
  • At least 3 Bitfusion server per cluster for high availability

This setup guide assumes that the graphics cards have been deployed to the ESXi 7+ servers and the hosts have joined a cluster in vCenter.

Continue reading “VMware Bitfusion and Tanzu – Part 2 : Bitfusion server setup”

VMware Bitfusion and Tanzu – Part 1: A primer to Bitfusion

This will be a multi-part post focused on the VMware Bitfusion product. I will give an introduction to the technology, how to set up a Bitfusion server and how to use its services from Kubernetes pods.

What is Bitfusion?

In August 2019, VMware acquired BitFusion, a leader in GPU virtualization. Bitfusion provides a software platform that decouples specific physical resources from compute servers. It is not designed for graphics rendering, but rather for machine learning (ML) and artificial intelligence (AI). Bitfusion systems (client and server) only run on selected Linux platforms as of today and support ML applications such as TensorFlow.

Why are GPUs so important for ML/AI applications?

Processors (Central Processing Unit / CPU) in current systems are optimized to process serial tasks in the shortest possible time and to switch quickly between tasks. GPUs (Graphics Processor Units), on the other hand, can process a large number of computing operations in parallel. The original intended application is in the name of the GPU. The CPU was to be offloaded by GPU in graphics rendering by outsourcing all rendering and polygon calculations to the GPU. In the mid-90s, some 3D games could still choose to render with CPU or GPU. Even then, it was a difference like night and day. GPU could calculate the necessary polygon calculations much faster and smoother.

A fine comparison of GPU and CPU architecture is described by Niels Hagoort in his blog post “Exploring the GPU Architecture“.

However, due to their architecture, GPUs are not only ideal for graphics applications, but for all applications where a very large number of arithmetic operations have to be executed in parallel. This includes blockchain, ML, AI and any kind of data analysis (number crunching).

Continue reading “VMware Bitfusion and Tanzu – Part 1: A primer to Bitfusion”

Monitor Tanzu K8s Compliance with Runecast Analyzer

Checking the cluster’s compliance for security or hidden problems is meanwhile a standard task. There are automated tools to do the job such as VMware Skyline or Runecast Analyzer. In addition to standard vSphere clusters, the latter can also check vSAN, NSX-T, AWS, Kubernetes and, since version 5.0, Azure for compliance.

In this blog post I’d like to outline how to connect a vSphere with Tanzu [*] environment to Runcast Analyzer. [* native Kubernetes Pods and TKG on vSphere]

Some steps are simplified because it is a Lab environment. I will point this out at the given point.

Before we can register Tanzu in Runecast Analyzer, we need some information.

  • IP address or FQDN of the SupervisorControlPlane
  • Service account with access to the SupervisorControlPlane
  • Service account access token
Continue reading “Monitor Tanzu K8s Compliance with Runecast Analyzer”

Heads up! Watch your NIC order when adding more hosts to VCF

VMware Cloud Foundation is a unified SDDC platform for the hybrid cloud. It is based on VMware’s compute, storage, and network virtualization.

VCF can be expanded with more workload domains by adding further hosts, or it can be stretched over two availability zones (AZ). The expansion is initiated by and under control of the SDDC-Manager. The procedure is fairly straightforward and SDDC-Manger does all the configuration tasks in the background, i.e. forming vSAN clusters, networks, kernel ports, vCenters and NSX control planes.

  • setup hosts with ESXi base image
  • confige a management IP address
  • set root credentials
  • configure DNS and NTP
  • import new hosts into SDDC-Manager
  • deploy new WLD

There is a pitfall that can be easily overseen: The order of the new host’s NICs. Before we can import new hosts, we’ll get to see a checklist about the host requirements. The hosts need to have two NICs with at least 10 GBit.

While reading the list there’s a little detail that is often overlooked. Traditional numbering means that both NICs must have numbers vmnic0 and vmnic1. Unfortunately this seems to be hard coded and cannot be changed (as of current version 4.2). To make matters worse, many server systems have onboard 1 GBit network adapters. There’s a KB article that explains how VMware ESXi determines the order in which names are assigned to network devices. It’ll start with onboard NICs and then continues with PCIe cards. As a result you’ll might end up with two 1 GBit onboard NICs as vmnic0 and vmnic1. In this case the bringup of the VCF expansion will fail.

While you can choose NICs during initial VCF bringup, this is not possible during expansion and this time there’s no such thing as a bringup sheet. You can’t select more than two NICs either when using SDDC-Manager. In that case you’ll need to use API-calls.

Workaround

Currently there’s no other way than to disable onboard NICs in the system BIOS. If your desired NICs still show a higher number you’ll need to put the PCIe card into a lane with lower number.