You don’t need an enterprise cluster in order to get an impression of VMware Tanzu and Kunernetes. Thanks to the Tanzu Community Edition (TCE), now anyone can try it out for themselves – for free. The functionality offered is not limited in comparison to commercial Tanzu versions. The only thing you don’t get with TCE is professional support from VMware. Support is provided by the community via forums, Slack groups or Github. This is perfectly sufficient for a PoC cluster or the CKA exam training.
Deployment is pretty fast and after a couple of minutes you will have a functional Tanzu cluster.
The TCE can be deployed in two variants either as a standalone cluster or as a managed cluster.
A fast and resource-efficient way of deployment without a management cluster. Ideal for small tests and demos. The standalone cluster offers no lifecycle management. Instead, it has a small footprint and can also be used on small environments.
Like commercial Tanzu versions, there is a management cluster and 1 to n workload clusters. It comes with lifecycle management and cluster API. Thus, declarative configuration files can be used to define your Kubernetes cluster. For example, the number of nodes in the management cluster, the number of worker nodes, the version of the Ubuntu image or the Kubernetes version. Cluster API ensures compliance with the declaration. For example, if a worker node fails, it will be replaced automatically.
By using multiple nodes, the managed cluster of course also requires considerably more resources.
TCE can be deployed either locally on a workstation by using Docker, in your own lab/datacenter on vSphere, or in the cloud on Azure or aws.
I have a licensed Tanzu with vSAN and NSX-T integration up and running in my lab. So TCE on vSphere would not really make sense here. Cloud resources on aws or Azure are expensive. Therefore, I would like to describe the smallest possible and most economical deployment of a standalone cluster using Docker. To do so, I will use a VM on VMware workstation. Alternatively, a VMware player or any other kind of hypervisor can be used.
This will be a multi-part post focused on the VMware Bitfusion product. I will give an introduction to the technology, how to set up a Bitfusion server and how to use its services from Kubernetes pods.
Part 3 : Using Bitfusion from Kubernetes pods and TKGS. (this article)
We saw in parts 1 and 2 what Bitfusion is and how to set up a Bitfusion Server cluster. The challenging part is to make this Bitfusion cluster usable from Kubernetes pods.
In order for containers to access Bitfusion GPU resources, a few general conditions must be met.
I assume in this tutorial that we have a configured vSphere-Tanzu cluster available, as well as a namespace, a user, a storage class and the Kubernetes CLI tools. The network can be organized with either NSX-T or distributed vSwitches and a load balancer such as the AVI load balancer.
In the PoC described, Tanzu on vSphere was used without NSX-T for simplicity. The AVI load balancer, now officially called NSX-Advanced load balancer, was used.
We also need a Linux system with access to Github or a mirror to prepare the cluster.
The procedure in a nutshell:
Create TKGS cluster
Get Bitfusion baremetal token laden and create K8s secret
Recently I activated Tanzu with NSX-T in my homelab. After some hurdles in the planning phase, the configuration worked fine and also north-south routing worked flawlessly. My edge nodes established BGP peering with the physical router and advertised new routes. New segments are immediately available without further configuratiom on the router.
One feature that distinguishes my lab from a production environment is that it doesn’t run 24/7. After the work is done, the whole cluster is shut down and the system is powered off. An idle cluster makes a lot of noise and consumes unnecessary energy.
Recently I booted the lab and observed that no communication with the router or DNS server was possible from my NSX segments. A perfect case for troubleshooting.
First I checked the Geneve tunnels between the transport nodes. Here everything was fine and every transport node was able to communicate with every other transport node. The root cause was quickly located in the edge nodes. Neither a reboot of the edges nor a vMotion to another host did improve the situation.
The Edges weren’t completely offline. They were administrable using the management network. Traceroute was working via T1 and T0 service routers up to the fastpath interface fp-eth0. From there, no packets were forwarded.
The interface fp-eth0 is connected to the distributed port group “Edge-Trunk” on vSwitch VDS-NSX. A quick check in the vSphere client showed that the uplink ports of both edges were blocked. Not in the “down” state, but blocked.
At this point, I would ask a customer what he has changed. But I am very sure that I did not make any changes to the system or the configuration. Yes, they all say that 😉