Transcript
Hi, Dave Littman Truth in IT. Welcome to today's webcast. Today we are supercharging AI in academia. Today's webcast is sponsored by DDN. And we're talking about how DDNs partnership with Purdue University advances data research. And in just a second, I am going to be handing things over to our panel. Today we will be led by Mike Matchett. You know Mike. Mike is CEO and principal analyst with Small World Big Data. And Mike will be joined by Kurt Kurt Kuckein, who is vice president of marketing with DDN and Preston Smith. Preston is executive director of the Rosen Center for Advanced Computing at Purdue University. We expect today's event to go probably about 30, 35, maybe 40 minutes. And of course, we will be taking your questions and comments in the chat room. We have staff standing by from Truth in IT to answer any questions you may have about today's video feed or audio or anything like that. We have also staff standing by on LinkedIn live to help out with any questions there. And of course we have staff from DDN to answer any questions you may have about today's content. But without further ado, let me hand things over to our panel and Mike Matchett. Mike. Yeah. Thanks, Dave. Yeah, I'm super excited today. I've got both Preston and Kurt here. Um, sorry, Kurt. I'm more excited about Preston today. We're going to talk about we're going to talk about what's going on at Purdue in the advanced computing center that you run there, Preston. Uh, let's just start a little bit with that, then. Uh, could you just tell me about and tell us about the scope of of this, of this center that that you're directing? What do you do at Purdue? Oh, yeah. Thank you. Um, so, yeah. So I'm, I'm the executive director of the Rosen Center for Advanced Computing. Um, in a general sense, we're the campus high performance computing provider for Purdue University. We support about 300 computational science research labs from over 60, 65 departments across the university. And that range from everything from mechanical engineering to chemistry to plant sciences to humanities and the business school. Uh, and we have users from all three Purdue campuses around the around the state of Indiana. All right. So when you say advanced computing, give us an example of some of the the kinds of computing you do there. What are what's what's the what's sort of the scale and the advanced part of that? Yeah, sure. Yeah. So naturally, a university like Purdue, Purdue is well known for for engineering. Obviously engineering is Purdue's bread and butter. So the the vast majority, about two thirds of the utilization of high performance computing is from that engineering school. Uh, our largest areas of Computationalist are from aeronautical engineering, mechanical engineers, people doing finite element analysis, uh, fluid dynamics, uh, computations for for hypersonics, for aviation, for advanced manufacturing, uh, computational chemistry, um, high energy physics. We have a center here that we operate in within the Rosen Center. That's a tier two level facility for the Large Hadron Collider on the CMS experiment run out of CERN. Uh, so it's very, very data intensive. But then the other half of Purdue, in addition to engineering, is agriculture. So increasingly, agriculture is a very data intensive field as well. We're seeing farmers are using drones for remote sensing. So they need to they need fast data pipes for moving their data to the high performance computing resources. Doing image processing sensor networks with IoT. So things that you never would have guessed 20 years ago that farmers would need to do are now increasingly requiring high performance computing. And we're happy to we're very happy to help our our agricultural researchers apply those technologies to their to their science. All right. Just one more question. Before I got to Kurt. Uh, when we talk about high performance computing, you know, traditionally I think of these massively scale out kinds of applications, you know, requiring large clusters, maybe doing simulations, things that have a lot of interconnected to them, a lot of scratch storage needs, things like that. Uh, tell us, though, you know, are you seeing a lot of new AI stuff coming along? I think that's everyone's going to ask that anyway. Yeah, absolutely. I think what we see, you know, we certainly do see a lot of that traditional HPC modeling and simulation use cases where you need the the high speed, low latency interconnects for parallel computing. Those CFD people that I spoke to, you know those are great examples, but we see lots of people who are doing who are doing resources that are maybe more embarrassingly parallel in their in their computations. They run lots and lots of loosely coupled work, um, which works great on those, on those HPC architectures as well. But yeah, increasingly over the last several years, we're seeing, you know, AI is starting to creep in more and more into all sorts of scientific domains. Whereas a few years ago it was really the computer scientists or maybe the computer vision people that were driving, driving the state of the art in AI. But now our user community is from is from all over the university. We're seeing people from, from the humanities area trying to use AI with the agriculture, the business school. If there's if there's a scientific domain that we serve, somebody is trying to apply AI to that. You know, one metric that we that we look at is the we looked at our our AI compute resource, um, about a year and a half ago when we ran these numbers, we had, I think, 74 PiS running on that. And then a year later. So at the end of 2023, it was up to up to over 140. So it doubled in terms of the number of pies using the central resources to do I in the span of a year. So that's that's really interesting. I think just to uncover that kernel because, you know, I think of engineering uses, we talked about some of those, you mentioned some of those. But then to think now that supercomputing demand has spread across everybody, it's infiltrated from from across. And uh, your workload sounds like it's doubling in that sense or and I don't even I don't even think we have time to talk about scheduling and how people share share the resources you have. Uh, but how is this how is this really stressed? Your infrastructure. What do you what do you look at and say, like, oh, what am I what are you worried about with all that demand coming in? Well, certainly with I, I think one of the things that keeps, keeps me awake at night as a resource provider is, you know, of course we have to get we have to meet the demands for GPUs, you know, like everybody else, you know, everybody wants to get all the GPUs that they can get their hands on. And as a university, you know, Purdue has gone 13 straight years without a tuition increase. So we're very proud of our ability to contain costs in higher ed. But that means that every part of us, including our HPC center, has got to be, you know, fiscally aware and manage our budgets. So, you know, finding budgets for for buying thousands of, of H100 GPUs as an example, just isn't something that we're, that we have to do. So we have to be very creative. So finding the resources to get GPUs to meet that I need is important. And then I'm sure many of the listeners here on the on, on your broadcast will recognize the challenges that the then come as the second order challenges of buying all those GPUs is you have to deal with with facilities, you know, power, cooling space and all the places with that you need to put those GPUs into. All right. We're going to just go take a break from that, because I definitely want to dive more into the ideas and the the GPU needs and how you spread those across things. But I want to ask you a little bit about, from the MN perspective, uh, obviously you have expertise in providing infrastructure, particularly storage through Zscaler and now Infinia to these places. How are you seeing that develop in the marketplace, and particularly with research centers like the one Preston here is running? Yeah. So we are definitely encountering the same challenges that Preston mentioned right there off the top. Right. The the fact that our customers are, um, you know, constrained in terms of, you know, the number of GPUs they can get and the utilization that they want to get out of those GPUs, um, coupled with, you know, budget constraints, real estate constraints, power constraints that are very real. Um, right, in the Silicon Valley, uh, we see just the ability to build new data center space is allocated out for, you know, three, four years. So we're not getting additional space to be able to put these GPUs anywhere. And that's true, you know, in lots of geographies. It's not just here in the US. I'm hearing the same things in Europe, if not even a greater challenge. And so those are one of the key things that we're pursuing with our solutions is making sure that those GPUs that are in place are getting maximum utilization. And so we're constantly trying to cut down on the overhead with things like increased write performance so that, you know, times that are spent as unproductive, let's say checkpointing during an operation. Right. That's time where the GPU is not computing on, say, a training run, but it's just writing data to disk. We minimize those types of events in terms of time, so that the cycles are fully used on those GPUs to do productive computing. Similarly, right just around packaging, we attempt to fit the most performance into the smallest package so that, you know, most of the data center space can be allocated to this powerful tool, the GPUs. And so you see, I mean, we've gone from needing racks and racks and racks of systems to be able to provide gigabytes or terabytes of per second of performance to being able to do that in just a couple of two new systems. Right. And we're up to now approaching 100GB per second in a single two U system to be able to provide, you know, super efficient footprint within the data center that allows these GPUs access to, to shared storage. That's as good as, you know, as if it was local storage. Yeah, I can almost see the the data center, Preston, that you must have there with sloshing back and forth, you know, like storage was expanding. There's even more data. Everyone's bringing in data. I got all this agricultural data and video data and stuff, and then it's like, nope, now I gotta. I gotta crunch that back down because the GPUs are coming in this way and taking over floor space this way. How do you how do you how do you stay on top of that, that battle of of of where and how much of each resource do you do you keep on the floor, try to get on the floor? Yeah, I think that's a great that's a great question. You know, one of the things for being a university center is that we don't have the luxury of really having like one consistent workflow that we that we need to optimize for. We do obviously have a few major ones, but we have to be able to support applications from CFD to bioinformatics and, and AI on the same system. So, you know, a machine has to be able to do everything pretty well. And what we've seen over the last couple of years, as both AI and GPUs have become important or CPU systems have gotten to be so much more denser. You know, 128 192 core compute nodes are now our norm. We've definitely seen the I o needs for these systems evolve over time. You know, these rough these rough rules of thumb that we would know that if we're building a system of X size, the storage needs to be of Y. Parameters have really kind of gone out the window over the last couple of years as as core counts have gotten denser and GPUs have gotten in there and the and the workloads have changed. But we've had to we've benefited a lot from those those, uh, those scales and the density that Kurt talked about just recently with this last year, we on our throughput AI machine, we we replaced the storage system underneath it because it was it was five years old or so. And we went from two petabytes in a rack and a half to, I think, five petabytes in about a third of a rack. So that increased density and getting more performance was a real win for us, because then that frees up floor space and power and cooling that we then turn right around and use to host more GPUs. Yeah, it's kind of funny because I didn't think we'd be having a capacity discussion. This is really about performance at some level. Uh, but it's fascinating to me. I used to do a lot of capacity planning for data centers. Uh, so in performance, let's talk about performance for a bit and maybe even a little bit about reliability. Uh, when when you are designing what goes next, the different generations that go in here, uh, how do you how do you just even think of that, since you have such a wide variety of workloads, how do you aim for certain performance points, and how do you think about generations of infrastructure coming through your center? Generally, the the performance point that we, that we optimize is, is probably for, for IOPs and, and capacity and generally the bandwidth of like falls out. And at the end of the day you get enough spindles to get capacity. And then the performance is is good for all of the applications that we see, but we see the most demanding applications in terms of storage needs have been bioinformatics, AI training, all of those things that really need to access lots and lots of files very, very rapidly. Right. And GPUs are just, you know, the applications that use GPUs, I should clarify, are increasing that demand for both capacity and performance, would you say? Yeah, certainly. Certainly both. Yeah. And just to talk just a little bit about reliability and restarts. Does that do you do you spend a lot of concerns. I mean you're not an enterprise. You're not flying airplanes daily or space shuttles in real life and operations. But with so many workloads and so many things going on, it's probably pretty critical that the infrastructure remains available and running to near near near capacity, near utilization. So how do you how do you approach that that at the at the center? Yeah, I think your analogy is spot on. You know, we're not doing healthcare. We're not we're not landing people on the moon. So there's no life or death death requirements for the for the machines to be up and running. We aim for probably 96% uptime. You know, one and a half nines, so to speak. Um, and we achieved that easily. Um, that said, no scientist likes having the cluster down for for for days. Days on end and having their grad students be idle, you know. So we certainly try to be proactive and make sure that the that the resources is appropriately utilized. You know, like we don't have a runaway science code driving performance down or having hardware failures making the entire resource available for for unavailable for everybody. Yeah. I mean and back to you Kurt, a little bit because I think, you know what I would see just standing back, you know, whatever happens in, in the university space eventually is what's going to I don't say trickle down, but eventually what's going to emerge and flow out into into commercial use or enterprise use or even smaller. Right. Because you know what they're pioneering very quickly these days drives what happens outside the university. Right. It's no longer, you know, 20 years later, it's probably. Yeah. Yeah. Two. Two months later. So what do you guys see? I think that is one of the major changes that we've seen. Um, is really this the demand for this type of computing, um, has exploded in the commercial space, right? Where as before, maybe you had a small HPC team, if you were a manufacturer, because you were doing some of the similar work that Preston mentioned around, uh, you know, CFD, um, or, you know, you were in the defense space. And so, um, you know, things like hypersonic flight were also something you were researching, um, right alongside the universities. Um, and so we are seeing the demand, um, because of AI, um, you know, being much more, um, present quicker. Um, and so that folks in the enterprise space are adopting um, this technology just as fast and right along with the expectation that it's going to operate like their other enterprise computing does. And so things like resilience and availability, um, are really, really important to them. And so that's that's an area where we're pursuing very heavily as well. Um, and one area where, you know, the use of the parallel file system does benefit those customers is the fact that there isn't one connection to the storage, right? It's the storage. Um, all the systems, all the clients know everything about all the data. Um, and so that when you have an individual link that goes down or even a GPU that fails, which right when we're talking about thousands of GPUs and now in some circumstances, tens of thousands of GPUs, you know, you're going to inevitably have, um, different failures. And so being resilient enough to account for those and keep those systems up, is hyper important. I mean, you get this. You get that. That scale happens. And now every failure just magnifies in impact on there. Uh, tell us a little bit, Preston. You know, just give us give us a little bit of an inside the look, uh, view of what you're doing at the cutting edge. You know? How are you? How are you helping your your university and your folks? Uh, go to that next mile, maybe beyond what folks might normally expect to see. Yeah. Some of the interesting things that our center has been working with researchers around campus for, uh, is certainly we're we're involved in some of the national defense initiatives that Purdue is, is is a leader in doing doing things with all sorts of aerospace type of research. Uh, we do a lot with controlled, unclassified information so that that intersection of, of the defense space with university research. Um, we we're increasingly as a resource provider, we're looking to cloud like interfaces to things. Like many of us who run HPC centers, we're used to the batch computing paradigm with things like Slurm. You're running our jobs in batch, but we've been working with a number of researchers to provide composable Kubernetes based systems that kind of that are kind of adjacent to the HPC systems, which opens up all sorts of interesting possibilities for perhaps like stateful databases and portals and inference engines. You know, all of those things that are like that cloud like experience that people start to see, particularly in the AI world, but leveraging our HPC investment and physically or networking adjacent to that HPC system. So there's a lot of really interesting opportunities for that composable world. Yeah, I mean, we know we can go out today and get some AI GPUs as a service from a couple of different providers. Getting HPC as a service is a different animal yet again. So it's kind of interesting that you're you're sort of pushing those buttons. But I think. That's also another, um, interesting space where I think the commercial world is driving the technology in that, you know, these cloud like services are just an expected, um, order of operation. Now that interface. Yeah, exactly. I mean, they're just they're there is no concept of doing it another way. You know, you don't want to wait weeks to provision things. Um, you know, your folks on staff want to be able to click a few buttons because they know they can go out and get it other places if it's not on demand. And so, um, we're similarly working in that direction, um, in that, you know, we have our new platform, Infinia. Um, that is, is really specifically designed to be able to offer those types of features, um, not just to commercial enterprises, but also to folks who are doing research work at scale. I think on the storage side, that cloud like interface, I think a very interesting place to explore. We see in HPC, like Posix file systems and that interface to data is still what I think many research codes look to and what people are used to. But when you when you look at the cloud world, when you've got object stores that S3 interface to things, you know, it's really attractive, it's really flexible, it can be really performant. So I think it'll be an interesting time if scientific codes can adapt that directly or abstract it in a way and take advantage of those new types of architecture. Yeah, it really sounds like your, your center or your job even is evolving from one of providing some infrastructure for engineering students to play with, to being more of an on demand resource across the entire university. For anybody who's who's going to have applications that need that infrastructure. And it sounds like the given the applications you've mentioned already, we talked about AI, we talked about agriculture. We talked about it's going to be every department is going to start coming around and saying, I need this, and I need this now. And I don't want to spend eight months planning for it. Right. Um, so so where so where do you where do you where do you see storage going at at the at the at the Roslyn Center. Where. What would you like to see built out? Well, I think what I think, what I see, um, you know, the big the big changes and the big opportunities are going to be around like the like that object interface to storage that I think is going to be an interesting transition in the next couple of years. Um, I think increasingly, um, the IO needs are going to, are going to require ubiquitous flash everywhere. We've traditionally been kind of judicious in putting flash tiers here and there for, for the, for the right, the right use case and, you know, maybe being cost sensitive on it. But these AI workloads are going to require lots and lots of flash. So that's going to be be a different thing to manage. Um, but the other really interesting thing around I would say this is more it's more of a data question, I guess, than a storage question is, is doing all the things around around fair data. How do you make the discover the data, you know, findable, discoverable? You know, all of those all of those library type of concerns. If there's a data set that a Purdue researcher has generated or has licensed, how do we tell everybody at the university that this is a thing that we have? And if you want to run science against it, here's how you can get to it. You know, those those sorts of questions, you know, are, are are also very interesting and maybe a little bit more abstract than the storage problem. I think that that is exactly what we find out when we talk to customers in the corporate world is, you know, they've got 14 global different divisions. They've licensed data here and there. They don't know what they have. They don't know how to use it. They don't know how to leverage it. And they're probably only looking at 1 or 2 applications. Whereas you've got this challenge in spades here. You've got all sorts of things. So Kurt, I'm going to just ask you about this. I know some of the things you mentioned are some of the things that I've heard you guys talking about, you're addressing, but where is that going with DDN? Yeah, it's for sure a huge challenge, right? Just being able to, um, you know, discover and manage the data in a centralized way. Um, is is is is a huge a huge problem out there today. Um, and more and more, uh, places are, um, choosing different ways to solve for it. But we do see, um, built into the product, um, having things like, um, you know, a database that aids in the tracking and the, along with automation, um, as being a long, um, going a long way towards solving some of these problems. Right. So, um, the fact that you have these also distributed, um, sensors out there. Preston mentioned at the beginning. Um, right. I mean, these are huge challenges in terms of in some cases, they're bringing in, um, really high resolution data, generating a lot of data. Um, and so what do you do with that data that's being generated out there on the edge. And how do you figure out which of this do I need to bring into like a core data center? Which of it just needs to be there for reference? Which of it can I actually dispose of, and how do I make sure that, you know, the thing that I'm going to need in two weeks is actually still available and can be brought back quickly? Um, you know, those are those are major challenges, but we see what we're basically setting up as a data plane, um, as being a core element of satisfying that need in that if you have that one plane that is able to manage from edge to center to cloud, then it is touching all of the data, and then you have the database that also features the ability to automate things like data movement. Um, then you can create the awareness of, okay, what's out there? Where is it and how do I get it to the place that it needs to be, when it needs to be there and is not subject to all sorts of manual intervention from administrators? Oh that's great. So, you know, given where we're at, I could probably start to drill down and we could talk lots about bits and bytes and bolts and stuff for the next five hours. But I think we've covered what we want to talk about today a little bit. But I've got one last question for you, Preston. Um, if you have one piece of advice for someone who is watching this today and said, I, I'm trying to stay ahead. I want to be on top of things. I want to keep one foot in the future. And it looks to you as providing that leadership. What would you what would you tell them? I'd probably answer that question. You know, maybe without a technical answer, I would, I would, I would make it I would advise somebody, particularly if you're if you're building a university based or, or even even even a research lab or corporate. The key thing that you need to be able to convey is that is the value proposition. You know, this has been one of my own interests. Um, as a, as a, as a, as a scientist myself is, is to build awareness for people who are in roles like mine, to be able to talk about these technology needs not just as a technology need, but as the way that returns value to the organization. You know, if you're in financial services or healthcare, the HPC is the thing that allows your product to be to be produced or made cheaper or whatever else. And in higher ed, the high performance computing and data investment is what allows us to train students. It's what allows us to get research grants and what that allows our scientists to to get publications. So yeah. So frame. Frame your needs in, in those, in those terms. And then I think you'll, I think you'll be successful in delivering the services that your research or community needs. Awesome, awesome. And I will just mention that I have a rising senior here at home here who's looking for universities and trying to figure out where to apply to and I'm sure he's going to be very interested in looking at Purdue, having having watched this when it comes out. So thank you very much for that. The tuition has been frozen for 13 years now. 13 years. Awesome. Awesome present. Uh, Kurt, uh, one last thing for you guys. What would you want someone to walk away with? If they're looking at this going, I need to find out more information about this. What's going on in this world and and how's it going to affect me? Yeah. For sure. Um, well, interestingly, Dean, um, we just relaunched actually our website, so we have a brand new set of information out there for people to go explore. So definitely go check that out. Um, reach out to us. We have a ton of expertise built up over the last few years assisting folks, um, like Preston, um, but also helping folks build, um, you know, commercial, um, resources as well, especially with this explosion in AI. Um, folks are turning to Don to be able to manage their at scale growth. And you see I consumer after I consumer I clouds all of these places turning to us to figure out how to manage this this really large data problem and keep it as simple as possible and with the utilization as high as possible. Awesome. Thank you guys for being here today. I had a blast. Thank you Preston. Thank you Curt. Thank you. My pleasure. Great. Uh, and with that, I'm going to turn it back over to Dave. Dave, back to you. Okay. Awesome. Mike. Thank you. Great job, great job. Preston and Curt, you guys have time for a couple questions? I'd love to hang around. Okay. All right. Awesome. So this one came in. It has to do with how you guys balance cybersecurity and data sharing. Um, so I guess let me ask that to you first, Preston. And then we'll pass it on to Curt. Yeah. So yeah, a university like like Purdue. Um, you know, cybersecurity is very important, uh, to to those ends. We were very excited with within the last couple of months, we were certified ISO 27,001. So that's very excited for the university having that baseline cybersecurity, uh, configuration. But on our HPC systems it's it's important. We are we like to we like to keep our researchers data, you know, located only to their lab. We offer a service called Research Data Depot, which allows them to have full control of their of their data, where they can share it with them on their graduate students. It's closed off from everybody else. And we have tools that allow them to share individual files or directories with their collaborators as they see fit. Okay, great. Awesome. Thank you. Preston. Kurt, anything to add there? Yeah, I think we see more and more organizations who need that type of capability, and we're building that right into our products. Right. Both exascale and infinia feature multi-tenancy. Um, and with Infinia, it's actually a first order, um, development feature. So it's really part of the core of the system so that you can maintain really secure multi-tenant access. So you can have data that's shared widely within the system and also data that's really, really tightened down. So it only has access to, you know, certain folks accessing that system. And that combined with data governance, I think will be the kind of the next level of security features built into the DDN product line. Okay, awesome. We only have time for 1 or 2 more questions. So let me get this one over to you. Preston. How do you manage campus wide redundancy? Yeah, that's a great question. Um, so we do this by offering our our storage research resource researchers in, in a couple of different tiers. You know, one that's tightly coupled with our HPC system. We we address that by, you know, I'll only jokingly say by by not addressing it, we optimize those, you know, purely for speed. Our scratch systems on the on the on the high performance computing systems, you know, being in the state of Indiana. I always talk about it like it's an Indy car. They're built to go fast. And they might they might hit the wall, but it's not going to come apart completely. But we do provide other storage systems for the researchers that are much more optimized for the situations where where they don't need to worry about their data, perhaps getting purged with multi-site redundancy within the file system, snapshots for data protection, and other tools like that. Okay. Awesome. Okay. Last question. Um, Kurt, let me send this over to you. Does DDN offer, uh, HPC or storage for HPC as a, as a managed service, or do you enable systems integrators to offer it as a managed service? We don't directly offer that, but we partner with a lot of the leading. You know, one of the big things right now is interest in GPU vGPU systems, and a lot of those GPU cloud vendors are integrating DDN as the back end of their service, just because, again, the utilization, um, is is very high with the DDN systems and the ability to scale those systems as needed to be able to accommodate the interest of their customers, um, is definitely there. So, um, folks like Lambda Labs or Vultr within the US Scaleway internationally, um, those folks are all on demand services that are using DDN on the back end to be able to provide those those systems. Okay. Fabulous. That's all the time we have for today. Thank you so much for joining us today. Many thanks to DDN for sponsoring today's event. Without their generous support, we could not have held today's event. Many thanks to Preston Smith with Purdue University. Kurt Cocaine obviously with.