Catch the video interview here: HPC for the Enterprise
Matchett: Hi I'm Mike Matchett with Small World Big Data. And we're going to talk today about how HPC technologies are finally coming to the I.T. enterprise in a big way through the workloads that we are hearing about. For AI and machine learning and processing massive amounts of data we all are going to need more power in our data center and I don't mean data center literally, but in the center where we process our data. Today I've got a Kurt Kuckein from DDN. He's a senior director of marketing and we're going to talk about something new they're doing with a partner of theirs called NVIDIA. You may have heard of them. GPUs on one hand from NVIDIA and super hyper computing file system technology from DDN married together into some reference architecture is pretty cool stuff. Welcome.
Kuckein: Thanks for having me on.
Matchett: So let's get a little started right into it what is the name of this is DDN A3I... What does that mean?
Kuckein: Sure. So these are our architectures that are specific for AI. So any scale AI architecture. So that's where we derive our A3I name from and they are storage systems that are tuned specifically for the needs of AI and machine learning workloads.
Matchett: So DDN has long been known as kind of the high end storage provider for supercomputing and workloads in enterprises that do massive simulations at that supercomputing edge. We talked about this is, is AI really the new workload for for the enterprise? What's going on there?
Kuckein: Well that's where we're seeing a lot of interest in HPC in the enterprise but you know customers who don't necessarily know or have the experience having done this type of big data workloads. What's the easiest way that I deploy this kind of thing and do I need to hire an HPC architect? Do I have to hire a bunch of eggheads? You know not that I'm not a smart guy as the I.T. administrator but just a place where I don't have experience. And so these reference architectures are designed to be kind of the easy button for those enterprise I.T. folks. They don't have to learn concepts deeply like parallel file systems which might be new for them. It's really designed so that they can comprehensively deploy something easily supportable. And it's also scalable over time. So these aren't things they have to worry about.
Matchett: So the reference architecture is not just DDN products but you've worked with NVIDIA closely here because it's a close tie between those two things. You've got the networking figured out the racking and everything else. So if I'm in I.T. and it's like someone asked me to build this supercomputing cluster well there goes the rest of my life. Instead you say, here's a blueprint and it works. And we'll support it with one call no matter all the parts that are in there right is pretty much the almost a convergence kind of value proposition.
Kuckein: It is exactly that, and there's additional pieces beyond that. It's not just us standing behind it or us in a video but we're enabling resellers to be able to support this solution themselves. And there's a comprehensive set of information also behind this reference architecture so that you're not just getting nice pretty picture of "here's a bunch of stuff in a rack" but really sizing guides to be able to say, "Okay here's my problem, here's the framework I'm using, here is the type of workflow I anticipate. Where do I start? How do I get single digit X1 from NVIDIA and 1 storage system? Or should I be looking at something more scalable right off the back to be able to meet the needs that I anticipate here in the next six months? And then what's going to happen three five years from now when my project is wildly successful?"
Matchett: We wouldn't spend a lot of time talking about the idea of reference architectures. I just think it's really cool. This is a reference architecture for supercomputing that I can now buy in my enterprise. Let's dig into that architecturally a bit on what goes into the recipe. There's obviously some DDN storage and we keep talking about NVIDIA DGX1s are fairly new to market themselves just to explain both those parts a bit and then how they work together.
Kuckein: The DGX1 is an architecture from NVIDIA that's really designed to be kind of the easy button for AI. So this includes your entire operating environment. It includes frameworks that are tuned by NVIDIA to really work for the customers and then that now combines it with the backend storage required to really drive those workloads. And so we take a little bit of a different approach from some of the other enterprise storage vendors out there on the market who are maybe doing similar things in AI in that they're taking a more traditional architectural approach which is something like NFS, with most customers are familiar with, in the data center, you know something that's really designed for fairly easy sharing but it was designed for more conservative workloads and not designed to be as scalable as something like a parallel file system. We're taking a parallel file system, something that we've used in scalable architectures that drive the multiple terabytes per second performance and hundreds of petabytes of storage capacity within a single namespace and really shrunk that down into something consumable for the enterprise IT data center. And there's other advantages in using that technology in that if you look at the DGX1 itself, it was designed to utilize an RDMA network for attachment to external storage. So we're using Infiniband or 100 GigE to drive the ultimate performance from our storage systems all the way in to the DGX. And so you know this kind of comprehensive approach not just makes for a fast storage system that's feeding data out of it really fast, but really it accelerates the application and even enables DGX1 itself to be more performant in that it frees up the internal network for inter-GPU traffic.
Matchett: A couple performance benefits there and a capacity benefit. You can spend a lot of money buying DGX-1s, you want to drive them as fast as you can and as much as you can in both in both directions and by removing the storage latency you can feed that thing and keep it doing more work and less waiting and less and less thrashing. And by keeping it fully utilized you're going to get more out of the box throughput wise. In addition so a lot of reasons there and I'm sure there's a TCO calculation somewhere that someone can say, "hey you know, if I buy DGX1s and with a bunch of GPUs and do this I only get this much return but if I can put that DDN storage layer underneath it I'm getting a couple of multipliers that are really going to work.
Kuckein: We have even published use cases that kind of bear that out. We've had partners like a partner like Parabrix who's doing genomic exploration using GPUS and you know by moving from a CPU architecture to a GPU architecture they take genomic analysis from a week long scenario into a daylong scenario. And that's just using GPUs and whatever storage you have lying around then put in a DDN system behind that and you're taking that daylong scenario to a couple of hours. And so now we've taken something that last year took a week to run. And now you're talking about a couple of hours to run the genomic sequence.
Matchett: I can't wait to shrink it from my desktop and I can do this in a couple minutes. But but we're getting there we're getting very close to that. Tell me a little bit about the the other use cases. You see this coming around full circle to where we started this conversation in terms of AI workloads for the enterprise. Or for more enterprises than used to just run the simulations. What are we seeing now coming down that people could do with this?
Kuckein: We have examples of across the industry. AI is such a broad term and inclusive term and even the frameworks that are out there are so broad and applicable to so many people, that we're seeing this everywhere. You've got your very AI centric type of customers, you know autonomous retail for autonomous vehicles. These are places where people kind of directly associate with AI. But we're also seeing it in more traditional industries like manufacturing for anomaly detection and for quicker identification of faulty parts. You can automate a lot of that work. We're seeing it in financial services. Again fraud detection. Things like that just speeding up that process even more by applying that GPU capability to their processing power is just transformative. We see it in the government space right with more and more autonomous vehicles being leveraged by those folks. There's a need for inference in the field to be able to do real time processing but then also the development of these systems is done through training and constant retraining to refine the operations of those in the field.
Matchett: There's a couple types a couple of type of things you're selling into this A3I banner. There's a performant version and a capacity version you might mix and match whether you do something closer to the edge in something in the data center capacity wise. Where can someone find out more information about this? I assume we come to your website. But is there anything special they should look for?
Kuckein: Yeah. We have a whole section on A3I on our website and we even have a banner right now right up front that links directly to all the info that they could want. On the website are things like white papers the reference architectures themselves, customer use cases so they can really figure out: How can I apply these concepts to my own workloads and be successful?
Matchett: Can't wait to open the door to the to the warehouse and start pulling in this rack of basically their own supercomputer and plug it in and feel the power that they now are going to have at their fingertips in their enterprises. So very exciting stuff. Thank you Kurt for being here.
Kuckein: For sure.
Matchett: Thanks again for watching and I encourage you folks stay tuned. We are going to cover more and more exciting solutions in this space including probably some more things from DDN I hear coming out soon. So take care. Talk to you guys soon.