PowerCLI - Number of vGPUs Available in vSphere

I’ve been working on a PowerCLI function for the last few months in my free time and now I’d like to share it with everyone. This is a pretty spiffy function for those folks working with vGPUs, and it’s not just for VDI it can help those looking to virtualize ML/DL systems too. (Think VDI by day compute by night.) The function calculates the vGPU carrying capacity of a vSphere environment.

What that means is, if I have an environment with with 30 hosts. Each of which have a bunch of NVIDIA GPUs in them. And on those hosts are a bunch of VMs with various vGPU profiles attached. This function will determine how many more VMs can be powered on with a given vGPU profile.

Yup you read that right, how many more VMs using a given vGPU profile can be powered on.

PowerCLI Code Snippit for Carrying Capacity Function

Now lets make it a bit more interesting. What if you could specify the cluster, or any VI container, in your vSphere environment for that calculation? Well the function I’m sharing does that too.

Let’s take it a step further, what if I only want to calculate my vGPU carrying capacity for powered on hosts, or hosts in maintenance mode, or disconnected hosts? The function will do that to…

How about mixed GPUs? For example having M60s, P40s, and P4s all in the same environment. The function deals with it all.

I’ve put the script with function up on github here: https://github.com/wondernerd/vGPUCapacity

Feel free to do what ever you want with the script I’ve published it under the GNU version 3 license so its free to do whatever you want to with it.

I’m not going to spend a much time talking about the concepts and constructs of it, but I will spend some time talking about how to use the function.

The bulk of the script file is a function that does all the heavy lifting. The function, vGPUSystemCapacity, takes three arguments. One is required, and the other two are optional. The function returns the number of VMs that can be started with the given profile and if an error were to occur it would return a -1 value.

 vPGUSystemCapacity vGPUType as String [vGPULocations as string] [vGPUHostState as string {connected,disconnected,notresponding,maintenance}] 
returns int [-1 on error]

The required argument is a string corresponding to the vGPU profile in the format of “grid_p40-2q” The format is “grid_” followed by the physical GPU type “p40” followed by a dash followed by the vGPU profile “2q.” The vGPU profiles can be found in the NVIDIA vGPU User Guide. This is shown in the following example of a function call requesting the results of a “grid_p40-2q” vGPU profile:

vGPUSystemCapacity "grid_p40-2q" 
200

Invalid vGPU profiles do not cause errors, so if you were to pass the function a value of “ColdPizza” for a vGPU card type the function will return a 0 value as the system can not support any “ColdPizza” type vGPUs.

vGPUSystemCapacity "ColdPizza" 
0

When the function is called with two arguments, the second argument is a string that corosponds the the VIcontainer[] object (ie cluster) you want to calculate the carrying capacity of. For example if I have a cluster named “production” I would pass that to the function as it’s second argument when using the function. You can also pass a wild card character to capture all valid VIcontainers[]. When no argument is passed for the second argument an “*” is the default value. This is to include everything in the vSphere environment. The example below builds on the previous example capturing only vGPUs in the cluster “production.” You can read more about the VIcontainer type on PowerCLI cmdlet reference.

vGPUSystemCapacity "grid_p40-4q" "Production" 
100

The third variation of the function takes into account the host state when calculating the carrying capacity. The third value is a VMHostState[] value that is passed to the function as a string. The valid values for host state are “connected”, “disconnected”, “notresponding”, and “maintenance”. You can read about these in the PowerCLI cmdlet reference document as well. The cool thing about these states is you can string them together as a comma delimited list to capture multiple state types all at once. When no string is passed the function defaults to “connected,disconnected,notresponding,maintenance” and will gather all host states. continuing on from our previous example, if we wanted to see the vGPU carrying capacity for connected hosts and hosts in maintenance mode we would use a function like this.

vGPUSystemCapacity "grid_p40-4q" "Production" "connected,maintenance" 
80

I built and tested this function on VMware PowerCLI 11.0.0 build 10380590 and on PowerShell 5.1.14409.1005. It should be backwards compatible several generations back to the point that the vGPU device backing was added in PowerCLI. Though I’m not sure when that was.

That gets you though working with the vGPUSystemCapacity function I created. If you’ve made it this far you may already have some ideas about what you can do with this. Here are somethings I’d like to do with it.

Use it to monitor how many more VMs of a given type I can instantiate on a system
Capture usage patterns through out the day, letting me know when I am at peek vGPU utilization in my environment
Use this as a core function to enable VDI by day and compute by night.

Let’s touch on this last bullet a bit. VDI by day and compute by night is a term lots of folks are throwing around these days. In fact I’ve done a blog on it myself. The premise is very simple, GPUs are expensive why let them sit idle in the data center when no one is using them? Capture back that time by letting them crunch on some business problems, traditionally at night. To do that we need to know one important thing. At any given time how many vGPUs of a given type are available to perform compute tasks?

Now if only there were some sort of function that could tell me that number… then maybe it would be possible to create a PowerCLI script that could manage all of that for me… Hmmm… I wonder what I’m working on next???

That gets us to the end of this post. Hopefully this script is helpful. If you have improvements post them below or on github. If you run into questions about using the script drop me a note in the comments below. And if you do something cool with it please be sure to share it with the community.

May your servers keep running and your data center always be chilled.