This will be a multi-part post focused on the VMware Bitfusion product. I will give an introduction to the technology, how to set up a Bitfusion server and how to use its services from Kubernetes pods.
In August 2019, VMware acquired BitFusion, a leader in GPU virtualization. Bitfusion provides a software platform that decouples specific physical resources from compute servers. It is not designed for graphics rendering, but rather for machine learning (ML) and artificial intelligence (AI). Bitfusion systems (client and server) only run on selected Linux platforms as of today and support ML applications such as TensorFlow.
Why are GPUs so important for ML/AI applications?
Processors (Central Processing Unit / CPU) in current systems are optimized to process serial tasks in the shortest possible time and to switch quickly between tasks. GPUs (Graphics Processor Units), on the other hand, can process a large number of computing operations in parallel. The original intended application is in the name of the GPU. The CPU was to be offloaded by GPU in graphics rendering by outsourcing all rendering and polygon calculations to the GPU. In the mid-90s, some 3D games could still choose to render with CPU or GPU. Even then, it was a difference like night and day. GPU could calculate the necessary polygon calculations much faster and smoother.
However, due to their architecture, GPUs are not only ideal for graphics applications, but for all applications where a very large number of arithmetic operations have to be executed in parallel. This includes blockchain, ML, AI and any kind of data analysis (number crunching).
When deploying workloads, you may encounter warnings or errors. Kubernetes pods are no exception. Problems can be more easily solved by taking a look at the logs. But how do you find the latest logs of a particular pod?
The standard command for this is:
kubectl get events
We can sort the output by timestamp and filter on a specific pod.
kubectl get events --sort-by=.metadata.creationTimestamp -n <podname>
kubectl get events --sort-by=.metadata.creationTimestamp -n <podname> | nl
Live display of events
In the Linux world, there is the tail command to display the most recent entries of a log file. In Kubernetes the analogous command is:
I’d like to point your attention to a new and useful feature which was introduced with vSphere 7 update 2. It is easily being overlooked in the abundance of new features, but it does a very good job in the prior to a vCenter update.
A requirement for the Update Planner is participation in the Customer Experience Improvement Program (CEIP).
The first sign of a new vCenter update is a notification banner at the top of vSphere Client.
Clicking on “View Updates” will take you directly to the Update Planner. This can also be found in the menu. To do this, select the vCenter in the Hosts & Clusters view and select “Updates” > vCenter Server > Update Planner in the menu bar at the top right.
All currently available updates are being displayed. In the case shown below, the vCenter is already at 7.0 Update 2, so only one possible update is listed. If several possible updates are available, the Update Planner can check the compatibility against all of them. To do this, select the radio button of the desired update (red box).
Once an update is selected, the action field “Generate Report” turns blue and shows the two possible sub-items “Interoperability” and “Pre-Update Checks“.
The Interoperability Check verifies not only the ESXi hosts but also the compatibility with other VMware products registered in vCenter.
Recently I activated Tanzu with NSX-T in my homelab. After some hurdles in the planning phase, the configuration worked fine and also north-south routing worked flawlessly. My edge nodes established BGP peering with the physical router and advertised new routes. New segments are immediately available without further configuratiom on the router.
One feature that distinguishes my lab from a production environment is that it doesn’t run 24/7. After the work is done, the whole cluster is shut down and the system is powered off. An idle cluster makes a lot of noise and consumes unnecessary energy.
Recently I booted the lab and observed that no communication with the router or DNS server was possible from my NSX segments. A perfect case for troubleshooting.
First I checked the Geneve tunnels between the transport nodes. Here everything was fine and every transport node was able to communicate with every other transport node. The root cause was quickly located in the edge nodes. Neither a reboot of the edges nor a vMotion to another host did improve the situation.
The Edges weren’t completely offline. They were administrable using the management network. Traceroute was working via T1 and T0 service routers up to the fastpath interface fp-eth0. From there, no packets were forwarded.
The interface fp-eth0 is connected to the distributed port group “Edge-Trunk” on vSwitch VDS-NSX. A quick check in the vSphere client showed that the uplink ports of both edges were blocked. Not in the “down” state, but blocked.
At this point, I would ask a customer what he has changed. But I am very sure that I did not make any changes to the system or the configuration. Yes, they all say that 😉