Empowering CUDA Developers with Virtual Desktops (Part2)

Woot!!! You’ve made it this far, or maybe you started here. In Part1 of this blog we looked at the problem of installing the NVIDIA CUDA Toolkit on a virtual machine (VM). In Part2 we looked at how to install the CUDA Toolkit on a VM. This post covers why installing the NVIDIA CUDA Toolkit on a Virtual Machine (VM) is a big deal for users and organizations. We’re going to let the genie out of the bottle to give us our wishes! This blog post answers the, “why would you want to do something like that?” question.

To answer the first part of that question about why I would do something like this is really simple. I like trying out new technology, but I don’t want to build everything out of physical architecture, that gets really expensive. Even if I have a single box where I multi-boot I wind up with several different hard drives and I can only run one thing at a time.

That’s where virtualization makes trying things like this out so very awesome. With virtualization, I get my wishes granted.

I get the ability to have multiple types of projects running in my environment, for example I can run the same project in both Linux and Windows to see which I like better.
I have mistake protection, since often times I’m fiddling with the Kernel when slipped keystroke and I can destroy my whole OS. I get this protection with VM snapshots. That means I can snapshot and roll back when I fat finger something.
Lastly I can transport and share what I am able to create. Since a VM is just a set of files, I can export it and move it to a new system or even share it with someone else who wants to look at what I’ve created (its great for troubleshooting).

All three of these things are really easy to do with VM’s not so much with hardware. So I get my 3 wishes!

Now the second part of that why would you want to virtualize the NVIDIA CUDA Toolkit or for that matter most GPU based computational workloads. I’m going to address this for both individuals and for organizations.

We should first address why individual users ought want virtualized GPU environments.

Developer Happiness

First off by going virtual it becomes super simple to keep up with new technology as its released. For example NVIDIA releases a new GPU. Wouldn’t it be awesome to be able to have that new GPU in your system with out having to rebuild the entire host? You can, its pretty straight forward. If NVIDIA supports it in virtualized environments (the Volta GPUs are the only class of enterprise GPUs that are not currently supported) and VMware has it on it’s HCL then move the VM to a new host with the new GPU, change its vGPU profile, and if necessary update the driver in the VM. That’s it, no reason to start from scratch and reset everything.

Speaking of new VMs. Any developers out there wish they could have multiple development environments so they could work on one project per desktop, but can’t get approval for multiple systems, oh and they don’t want all those systems stacked up under their desk? Virtualization can address this in a few different ways. First if there is an unlimited budget (I’ve yet to find anyone with that) you can spin up as many identical development systems as you want and just move between them like tabs on your desktop. The second and more realistic way to do this is to be able to support one or two running development environments. When you are done with one shut it down, and that releases the resources back to the pool where they can be used to run other environments. This could be thought of as multi-booting a single box on steroids.

It’s great having access to all these extra VMs, but what happens when I’m done with a project, does the VM just vanish? Well it can if you want. Or, wouldn’t it be great if you could archive the desktop and save it like you do project files? You can do that! Since VMs are files, they can be archived, which means you can save a development environment for a given project. It’s no longer necessary to loose time rebuilding an environment when you revisit an old project.

Wouldn’t that be cool to have versioning of your development environment, not just a version of code but the actual environment? Virtualizing a workstation can allow this with snapshots (note you cant snapshot a running VM with a vGPU in it at this time). Having a snapshot gives you the ability to move around to different points of time in your development environment. A great example of this, is the work I did for this set of blogs, I built a base VM, then snapshotted it as I went along. If something didn’t work I just revert back to my branch point and try again. This saved me hours of work in rebuilding my test VM.

This probably sounds great, but it sounds like a lot of extra work… Every time a new project starts I’d have to do the exact same things to setup a development environment all over again. You know, set global variables, install this package, recompile the kernel headers, etc. No one wants that! Which is another reason virtualizing rocks!!! You can setup a VM exactly like you want and use that for the template of any or all of your development VMs. Then whenever a new project is started, a workspace is already to go and there is no need to repeat all the standard installation tasks.

Organizational Happiness

I could keep going on with user scenarios, but I know IT admins are chomping at the bit to find out why this is good for their organization. I’d like to switch now to some of the reasons this is a big deal for organizations. Many of these build on the points from above.

Being able to deliver the CUDA Toolkit enables a lot of cool options for organizations. The one that springs to mind first is the enhanced security to the organization from virtualizing these developers. No longer is there an expensive system sitting under someone’s desk, the system has been moved into the data center. There is less chance of something wandering off.

The typical response to this is that this can already be done with physical systems and letting developers VNC/RDP/RDSH into the the machine. The host is in a secure area, same result. Which is true to an extent, however with virtualization it’s possible to secure the data at a very granular level. With VMware it’s possible to control what devices are allowed to connect to the VM. That means you can disable removable storage from the VM (and any other USB devices you want) and prevent users from copying data and walking out the door with valuable IP.

That may be all well and good but what keeps user’s from copying and pasting code or anything else from their development VM to their local device? That’s another cool feature of VMware Horizon, you can disable copy and paste capabilities for VMs. This helps keep digital assets safe where they are supposed to be, inside the organization.

This is one form of data protection, it would be great to protect developers desktop from damage. Above I talked about versioning and archiving the developers environment. This is another awesome advantage of virtualization (not specific to GPU based systems), data protection becomes so much easier. You can backup all the developers systems. That way when something gets removed or they corrupt their image you can just recover the last backup and move on.

Wouldn’t it be great if it were possible to automate delivery of all these systems for developers and not have to order a new system, configure it to organizational policy, and then take it to the developer? Not to mention the fact the developers all have their own “special” systems which are completely different from the rest of the organization and are never purchased in a standard way so there are no discounts to be applied… By virtualizing a developers system, all the sudden things become standardized. The developer can have a system that matches everyone else’s, there’s no need for “special” orders. That means you can standardize on systems in the datacenter too!

Why, you might ask? Because chances are, IT already orders something like a Dell R740 or Cisco C240 M4 and has special pricing for them. Thus the only significant variation is the GPUs that are being installed, which happen to be the same ones used for HPC, Machine Learning (ML), Deep Learning (DL), and VDI. That probably means it’s a standard, repeatable, order for IT to provide saving the organization time and money.

This also provides a great life cycle plan. The newest servers can start as HPC, ML, DL, and high end developer systems, then once they’ve aged a bit and people need the next big thing they can be migrated into less demanding roles such as less demanding developers or typical user VDI hosts. This allows the organization to realize additional financial advantages while keeping their developers outfitted with the latest hardware.

You may also hear the claim that performance won’t be on par with a traditional physical system. It actually should be pretty darn close. Because of the architecture used, the hypervisor consumes very few resources and has minimal impact on calls to hardware such as processors and GPUs. Because of this results will be similar between physical and virtual hosts.

I have two things left that I’ll cover, one is the perception that there will be tons of resources unused on the ESXi hosts that house these VMs (which is strangely contradictory to the point above). The typical rational goes, you have a developers system that consumes 80% of the resources of a physical host and you can’t put another VM or two on the same host as it might impact the performance of the developers system. Here’s a simple way around that, use shares. That’s right use shares for resources inside of VMware. The way a share works is when there is resource contention, which ever machine has the most shares will get priority on the resources! So if the developers VM has 10,000 shares and the secondary system has say 10 shares, guess which one will win contention arguments for resources? The developers system! And when there’s no contention both systems just keep right on trucking. (This is probably my favorite benefit that I’ve talked about in this blog.)

The last Item I’ll cover for organizations is the sharing of resources… when a developers system sits at his or her desk it’s hard to share those resources with other developers when the system is not in use. By virtualizing developer environments, resources can be redistributed to other areas of the organization. Now instead of resources sitting idle under a desk, they can be part of a pool of resources that are shared. For example developers in the United States start shutting down there systems (or letting them go idle) around 5PM, developers in India start logging in at about the same time… wouldn’t it be great if both could share those resources? OR developers go idle at around 5 and the HPC, ML, or DL systems kick in and start using the idle resources to speed computing operations.

Hopefully these reasons resonate with both developers and organizations. By enabling developers with VMs configured to leverage GPUs significant benefits can be gained at both the individual and organizational level (a small sample in the graphic). The genie is now out of the lamp and it can fulfill the wishes of many.

Be on the look out for additional blogs on things I’ve learned from virtualizing the NVIDIA CUDA Toolkit.Hopefully you’ve enjoyed the blogs on this topic thus far and they have been helpful. If you have questions or comments please be sure to post them below or contact me directly.