Nvidia’s Manuvir Das: We ‘Mimicked’ VMware For Enterprise AI

The chipmaker’s new Nvidia AI Enterprise software was not only developed in tandem with VMware, but it was also modeled after the virtualization giant’s software to help spur enterprise adoption of AI, Nvidia’s Manuvir Das tells CRN in an interview. ‘We almost want this to be as though Nvidia AI Enterprise is a new software stack created by some team at VMware,’ he says.

ARTICLE TITLE HERE

Making AI Accessible For Every Enterprise

Nvidia is getting very serious about pushing AI and other GPU-accelerated applications into enterprise customers, and it wants to get things right, which is why the chipmaker is emulating VMware, the dominant virtualization vendor for the data center market.

This mimicry is seen in the chipmaker’s new Nvidia AI Enterprise, a suite of software applications, tools and frameworks for running applications accelerated by Nvidia’s GPUs on top of VMware’s vSphere platform in existing data center infrastructure. The idea is to make it just as easy to spin up virtual machines and containers of GPU compute as it is for general-purpose CPU compute.

[Related: How Nvidia Is Helping Partners ‘Democratize AI’ For Enterprises]

id
unit-1659132512259
type
Sponsored post

In an interview with CRN, Manuvir Das, Nvidia’s head of enterprise computing, said Nvidia AI Enterprise was not only developed in tandem with VMware, but it was also modeled after the virtualization giant’s software and strategy, from the way vSphere works to the licensing model, to incentives for partners.

“We almost want this to be as though Nvidia AI Enterprise is a new software stack created by some team at VMware,” said Das, a former Dell EMC executive who also played a crucial role in the development of Microsoft Azure. “So everything — the sales motion, VMware‘s team, all their channel partners, the incentives they provide to their channel in terms of margin and discounting and [market development funds] and all those things — we’ve just mimicked.”

This means Nvidia AI Enterprise could be just as easy for VMware partners to sell as Nvidia partners, which will open up new services opportunities for both groups, of which there is already some overlap. The goal was to meet the data center market on its own terms instead of creating some new structure for GPU-accelerated applications, according to Das.

“In fact, the way we did the pricing for Nvidia AI Enterprise, it matches the pricing of vSphere, and even the model is priced per [CPU] socket just like vSphere,” he said. “It‘s not priced by GPU.”

What follows is a transcript of Das’ remarks from his interview with CRN, where he talked about the importance of the Nvidia AI Enterprise to the chipmaker’s enterprise strategy and how the Nvidia-Certified Systems program and Nvidia EGX platform will serve as vessels for those efforts. The transcript has been lightly edited for clarity.

The Big Enterprise Push With Nvidia EGX, Nvidia AI Enterprise

This year represents a big pivot for Nvidia with enterprise customers because, of course, we had a lot of announcements about new use cases and new libraries and all that that we keep doing. But the big shift is, for the first time, we‘re saying that for enterprise customers, we’re ready to democratize AI, and we’re ready to make this thing usable by every enterprise customer.

And so really there were two things that I want to highlight for you in what we‘re doing this year. First is on the hardware front: As you know, the GPUs are not exactly cheap, and typically when people have done AI, they build servers where they pack them with GPUs and put them in a special part of the data center [that needs] special cooling, all this kind of stuff, and you basically use those servers for one thing, which is your AI workloads.

But what we‘ve done here with [Nvidia’s] EGX [reference architecture] is we worked with the OEMs to incorporate GPUs into their volume servers, their 1U, their 2U [configurations]. The Dell PowerEdge series, for example is one. This is a server that a customer would procure for $10,000 to $12,000; it now has one GPU in it, and its cost goes up by like a couple of thousand dollars. That’s the mental model here with the configurations we’ve been working on.

And the intent is, if you think about customers on premises, there‘s so much pressure to go to the cloud as it is, and so every time I’m racking and stacking servers in my on-premises data center, what’s the reason for doing it? A lot of the conversations that we’ve had — and I used to be at Dell Technologies before I came to Nvidia, so I’ve seen that side of it — is about, “okay, we’re doing software-defined data centers now, and we’re putting in commodity hardware, servers that can be used for a variety of purposes, and we’ve got a [request for proposal] out, we’re doing a tech refresh — what’s the next server we put in at some scale and volume to run a variety of workloads?”

And so now we‘re in that conversation with the OEMs, and we’re saying, “okay, the server that you present for the RFP has a little bit of GPU in it.” And the idea is, there’s a mix of workloads. For some workloads, actually the GPU will not be used. And it’s a no-regrets thing because it’s not like the server changed dramatically in some way. But for the workloads where GPU is used, it’s significantly better. And so from a workload mix point of view, it’s the right thing to do to stack your data center with these kinds of servers with GPUs in them. That’s one part of it.

And then the other part of it is [for the Nvidia AI Enterprise software suite], working with VMware, was that it comes into the same [data center] estate. So for the IT admin, it‘s the same: You’re creating a pool of servers with VMware. Some of them have GPUs in them. And as you’re provisioning workloads on the same server, some [virtual machines] would use the GPUs because they’re running those kinds of workloads. Other VMs, they just ignore the GPU because they’re running some other workloads that are not accelerated by GPUs yet.

A great example is, some of the early customers we‘ve been working with, we had one conversation yesterday where the IT people said, “Our data scientists have been using the cloud in some way. We have a large VMware-based estate, and we are trying to give them access to GPUs on-premises so that they can start to do their work, and we don’t quite know how to do that because we have to build a silo for that. And so we will.” So my counterpart at VMware and I were explaining to them how they can do that now [with Nvidia AI Enterprise], and they’re very excited about the idea because it’s the same pool, the same farm. It’s just another kind of instance type managed by vSphere, but this instance includes some part of a GPU. That’s my final point to you.

The other part of it is, because of our technology with the Ampere GPUs where you can split the GPU into sub parts, vSphere understands all that now. And so every data scientist can get just as much GPU as they need. So that was the basic philosophy of what we‘re trying to do with EGX.

What Makes EGX Servers Different From Other Nvidia-Certified Servers

Think of EGX as a subset of Nvidia-Certified. And the reason is because when we started the Nvidia-Certified program, it was a continuation of our existing model where [Hewlett Packard Enterprise] is building a server with four GPUs in it and four A100s in it, etc., and that‘s all part of Nvidia-Certified. But that’s not really EGX. That’s more like creating a Range Rover SUV. What we’re talking about is a little mini SUV here, so, of course, every server for EGX is Nvidia-certified, but its focus is these small-volume servers, these 1U, 2U kind of servers.

[EGX can be used] for edge computing or for a farm in a data center. So, for example, with AI, there‘s two actually quite distinct use cases. One is where you do training, and you do deep learning to train your models. But then the other is inference, where you’ve got the model and now you’re just using it in real time to answer questions. Inference is something that, if you look back at the last decade, has typically been done on the CPU, whereas the training has been done in GPU. And in the last couple of years we really changed that and now we have all the software to do the inference on the GPU as well. And so I think that’s a very good use case for EGX within the data center. And then another use case is data analytics, Spark, those kinds of things. So I think you’ll see EGX servers in the data center, and you’ll see them definitely on the edge of course.

Nvidia’s Strategy For Selling Nvidia-Certified Servers With Nvidia AI Enterprise

We‘re doing it really as a bundle. So we also have the software, which we announced in conjunction with VMware. Our software is called Nvidia AI enterprise. We’ve specifically licensed that and set it up the same way as vSphere. The way we are propagating this with the channel and the OEMs is as a solution bundle, where it’s the certified server and vSphere license and Nvidia AI enterprise license, and that’s the AI solution that the channel carries, and that the OEMs carry.

The Amount Of Work It Would Take To Replicate Nvidia AI Enterprise’s Capabilities

So the alternative really would be to run in a sort of unmanaged bare-metal environment, where you basically deploy some flavor of Linux like Ubuntu or Red Hat Enterprise on the server. And you could certainly [use Nvidia’s] NGC [software catalog], where we provide containers for different kinds of workloads. So that would be the alternative, a sort of a DIY model where you take your own server, you take the same Nvidia-certified server, you take the server and you deploy some flavor of Linux bare metal and then you pick up containers from Nvidia. So the comparison would be in that DIY model.

The first thing is that it‘s an unmanaged environment, and it’s not a shareable pool like vSphere does with a cluster. You could try to deploy Kubernetes on your own and all of that, but that has its own challenges. Today, bare-metal Kubernetes, for somebody to do [that] on their own is not really easy. And then you would have to get all the hardware capabilities plumbed through, not just the GPU, but the optimizations we’ve done. So that would be the first difference.

And then the second difference would be, yes, we provide these containers and NGC, but that‘s again a DIY model where, “Here’s last night’s build of the software and go for it.” So NGC is really actually meant for the developer ecosystem that’s developing applications and things, so they get the greatest tools from us. For an enterprise customer, it’s not really a viable model because, for example, there’s no support model with that for any period of time. And enterprise customers are not sitting there and saying, “Oh, I’m going to upgrade to this week’s container, because last week’s is gone.” They need support for some period of time. For example, who picks up the phone if something breaks on Saturday night? So all the enterprise grade [service level agreement] stuff is not available.

And then the final part is, there‘s a lot of performance optimizations we’ve done specifically in the Nvidia AI Enterprise software, which again would not be available if you just go and pick up the containers in that fashion. That’s the comparison. The comparison is DIY, which many of our customers do today. They struggle with it quite honestly right because it’s not easy.

The Services Opportunities For Partners With Nvidia AI Enterprise

One great example is, there‘s this whole area of ML Ops, which is particularly important to enterprise customers. If you think about these models for AI now, they’re just another form of software, because you end up using this model, you ask it a question, it gives you an answer and you use that in your application. It’s a form of software. If you think of how enterprise companies use software, you go through validation testing what version of VMware, Microsoft software, whatever is it and you deploy it.

But on the other hand, what‘s happening with data science is, “Oh, my data science team told me this model from three days ago is really good, I should use it. And now they’ve got a new one from yesterday, I should use it.” And this is a very uncomfortable kind of thing for an enterprise customer. Like, what’s the lineage? How do I know where this model came from? What’s the data set it was trained on? If I want to go back and change something, who can reproduce that?

When you do software, there‘s a whole chain of development if you will. People use GitHub, things like that. There’s versioning, all these kinds of things that you get used to, which really needs to be there for the data science world. There’s a number of startups that have begun working in this area. And if I’m a startup, the challenge for me always is, I want to do innovation up here on a capability like that, like experiment management, data set management, but then I have to do all this work underneath.

But now, what we‘re seeing with Nvidia AI Enterprise and vSphere is we’ve done all that work, so ecosystem, please arrive, do interesting solutions for ML Ops incrementally on top of what we got and go sell that and drive revenue from that, and we’ve done all the plumbing. So it’s not that different from something like Windows Server, where the model is Windows provides a whole bunch of capabilities and there’s been a huge ecosystem of [independent software vendors] that have put software on top of that, right, so that’s what we envision here, for sure.

The Importance Of Getting Partners To Sell Nvidia AI Enterprise

It‘s a huge priority, and that’s where I think it didn’t really come out in the plethora of announcements we have. Literally think of it this way: Fundamentally for Nvidia, in this whole compute space, we’ve got two missions now. One mission is, for the people already using AI and all the work that’s already happening, move the state of the art with the hardware and the software, make the hardware better, new libraries for new things. And then parallel to that is mission No. 2, which is equally important, which is democratize AI. So at the highest strategic level of the company, these are two parallel missions now.

As [Nvidia CEO] Jensen [Huang] likes to say, “you have to wait for the conditions to be right.” And we are now at the point where we feel the conditions are right, not just because there‘s maturity in the software and the hardware, but also we work with customers a lot, and so from a customer point of view we really see this now.

On Getting VMware Partners To Sell Nvidia AI Enterprise

From the outset, the mental model we set with VMware is, we almost want this to be as though Nvidia AI Enterprise is a new software stack created by some team at VMware. So everything — the sales motion, VMware‘s team, all their channel partners, the incentives they provide to their channel in terms of margin and discounting and [market development funds] and all those things — we’ve just mimicked.

In fact, the way we did the pricing for Nvidia AI Enterprise, it matches the pricing of vSphere, and even the model is priced per [CPU] socket just like vSphere. It‘s not priced by GPU. Everything was done with that in mind. The joke we have internally is, the way we’ve done AI up till now, it’s like, you go to the Romans and you say you need better roads — you got to move to Florence. But this time we are saying, “do as the Romans do. How about we give you better roads in Rome and don’t ask you to move to Florence?” That’s kind of the mental model behind everything.

On The Signals Indicating Growing Enterprise Interest In AI

We are very strong partners with the cloud service providers, and we provide them with a lot of GPUs that they then offer as instance types in the cloud. And so we see two things. One is, we see ever-growing demand for GPUs in the cloud. And the cloud service providers, they don‘t ask a vendor for more capability unless they see demand and they see usage, so we see a lot of AI activity happening in the cloud. That’s one thing. So we know from that that enterprise customers are doing a lot more AI.

At the same time, the second thing we see is a number of customers that we talked to, for example, in financial services or healthcare who say, “If that‘s my only choice to do AI in the cloud, I’m kind of stuck because there’s a reason I have an on-premises data center. When I want to put a little bit of AI into my hospital, in the basement of the hospital where I’ve got servers for different things, and I want to put a couple servers there to help transcribe this and transcribe that and do that, how can I do that in the cloud? That’s a non-starter for me.”

We see a lot of demand for things in the edge. Everybody who‘s got a physical presence is putting a camera in their stores. They just are, because [they’re using] AI to understand what’s going on with the people. So on the one hand, the signal is, seeing the activity in the cloud, which shows that a wide variety of enterprise customers are beginning to adopt AI. And then the second signal is, in all these different verticals, where it’s hard to be in the cloud. And so the combination of those two things tells us that with the OEMs, we need to provide a solution on-prem.