Transcript
Hi, Mike Matchett with Small World Big Data and I'm here today talking of course about artificial intelligence and how do you deploy it at scale and speed and efficiency in your organizations? Every IT shop is wrestling with their AI infrastructure decisions, their MLOps, their DevOps. How do I get some bang for the buck? I'm investing and not just waste all that money on GPUs. Hang on, I've got a solution for you in just a second with Pipeshift. Hey, welcome to our show, Arko. Uh, you have got some emerging technology there with Pipeshift . Uh, welcome to our show. Uh, before we dive into too much about talking about AI and some of my favorite things about that, maybe just tell us a little bit about yourself. How did you get involved in AI? Uh, what gets you excited? Uh, and what are you doing with Pipeshift in general? Uh, thanks for having me here, Mike. Um, how I got into AI is probably like a long story. Like it's more like a long time ago, uh, during my undergrad, which is about six years, seven years back when we were building robots, uh, for different defense use cases and ended up using a lot of these traditional ML models, deployed them on cloud, on edge, uh, and truly worked on AIML, which is the traditional AI ML post that once JNI was in, in in town. Uh, we actually ended up building, uh, a large AI deployment workload that was running within an enterprise were of 1000 plus an employees, which was fully powered by a llama two model. Uh, as a team, we are huge proponents of open source. We kind of, uh, for the longest time, my devices, all my devices were powered by Linux than windows. So I think that's that's where the llama two, uh, bug kind of struck me. And then we truly saw how hard it is to actually scale large scale workloads when it comes to these new age models, because the size of these models have become huge. The infrastructure that supports these models are all GPU powered, and that's expensive. And when enterprises truly now, after all of their POC cycles, are now moving to In-housing, some of this tech are truly grappling with the issues around how do I set up the infrastructure. What kind of investments do I need to make? How do I scale this infrastructure? Because for the last two years, all they were interfacing with APIs that were given to them by all of these model providers. And now deploying it has a lot of tooling and orchestration that's required. And enterprises have billions of dollars worth of budgets behind them. There are billions of dollars worth of investments behind them that went into setting up their infrastructure. Do they scrap all of that? Do they set up new infrastructure? So that's truly what we decided to finally solve for after having run that large workload. And that's how Pipeshift came into being. All right. So again, just a little bit more before we dive into Pipeshift itself, because that's going to be let's talk about what the problems someone faces when they're setting up their AI infrastructure. Right. They can use public clouds, obviously, but they get a big bill or an unknown bill. They don't know what they're going to pay. Right. It's pay as you go. So they end up, they get surprised. So then they decide, well, we should we should own some of our own real estate here in this. I think we're going to have AI forever. We're going to be competitive in it. And they start looking at buying, uh, investing in nodes, clusters and basically GPUs. What tends to be the the bad things that happen to them when they start doing that. What what problems do they run into? You mentioned a few of them before, but maybe just give us a top, top five list here. Um, yeah. So I'd probably break it down into like a two phase journey when they're first moving away from, say, closed source and trying to use open source models and trying to set up their in-house AI infra. I think the best way to start off is cloud, for sure. Uh, because that gives you predictability of how much infrastructure you need to invest in. Once you start moving some of it offline to say on prem or you want to set up like private data centers or co-location data centers, whatever you might prefer at that point. Uh, so first is pretty much moving towards reserving some like smaller nodes or individual GPUs for various workloads on your cloud provider. Um, they will tend they, they tend to end up pushing you to buy their platforms. But that's like a vendor lock in that they are pushing for, which will again lead to the same problem with vendors, which is cost creeps. Uh, instead of that, most companies, now that we have spoken to, try to build that tooling and orchestration on their own. And that also just that problem just scales when it comes to on prem, because a large part of the software tooling is also not available. Uh, now, what are the problems that they face over there? Uh, there are issues. A very common issue is auto scaling. Like, traditionally, you would feel like auto scaling is a solved problem, because CPUs just scale up so easily and scale down so easily. With GPUs, that's not really the case, because in most cases, GPU compute is always maxed out because of the nature of how GPUs work. Um, and I think, uh, we were also off the record chatting about how Vram is very different in this world, uh, suddenly. So now when you're looking at that, you have to do things like GPU profiling, and then based on GPU profiling, you have to scale up, scale down, um, things like AWS SageMaker, such a popular product, even that doesn't come with auto scaling. And that kind of talks about how difficult it is to figure out, um, things around these new age infrastructure, uh, that the tooling for this new age infrastructure. Second problem that comes in is these llms or even if I talk about talk about vision models, any of these AI models, they are huge. The orders of magnitudes of these models are just hundreds of thousands, hundreds and thousands of times bigger than traditional models. Uh, a small 8 billion parameter model. I call it small, but it's like 16 GB of memory. So minimum of 24 GBS of a card is required to just run these models, which is insane when you look at it. Compared to what ML models used to take, where you could just run them on the edge on your PC. Did not even have to have like a Apple silicon chip over there. Um, and I think that's the other problem when you're talking about such large workloads, GPU memory starts choking. So how do you free up GPU memory to actually get the optimal performance from these models, be it inference speed, be it latency, be it throughput capacity of a single GPU. Because as you scale, you can't just throw more GPUs at it because individual GPUs also cost cost a lot. So like you can't just suddenly scale with ten GPUs to run the workload. That is still not giving breaking even on the investment that you're doing behind it. So I think these are the major two problems that we have seen. And then comes the issue of how do you orchestrate this across your existing infra. Now most companies that we work with or we have spoken to as well, uh, tend to have like a multi-cloud or a hybrid cloud setup where some of it is on cloud, some of it is on prem, um, even on cloud, they might just have hedged their lock ins by going across different providers. How do you orchestrate workloads across all of them? Because at core, an A100 running in AWS versus an A100 running in GCP versus an A100 running on prem, same thing for A100s or any other GPUs should perform the same. But they don't because of all the software bloating that these cloud providers have on top of it. So how do you strip all of it down? Get to the bare like the VM instance or the EC2? Uh, in in traditional AWS language, the EC2 and run things on top of Kubernetes so you get the same performance no matter where these GPUs are. So I think these on a high level are the three major problems. And then um, comes the tooling issues which are traditional VPC deployments, multi-region deployments, SLAs, SLA is cold starts. Uh, scale ups. All of that. But then those are things that are probably solvable on the long term compared to these three, which are just threshold bottlenecks for you to just cross and enter the, uh, on prem. As a long term capacity planner, I'm just really appreciating that the bottleneck and the problem really boils down to how do I how do I control, manage, plan, optimize a key resource, i.e. the GPU, uh, when it's not really about the utilization of the GPU? Because if I have a GPU, I'm going to try to drive it 100%. So I can't really use that as a threshold to determine when I need my another one. I have to be looking at these other things. As you mentioned. I'd be looking at the memory. So memory becomes a key thing. People don't really understand how to plan around memory that well, uh, just natively like, we just, you know, we don't have a mind for looking at memory and saying like, oh, is it about, you know, throughput? Is it about latency is about all these non-linear things? As a capacity planner, I might point out to people, and we don't have non-linear forecasting abilities as human beings, right? We need to. We need to. We need something. We need some help. Uh, so, uh, yeah, people are definitely, definitely in the situation, right? They've got they've got infrastructure, they've acquiring GPUs. Now they're trying to get their efficiency up on them, get their throughput up on them, get the workloads going through them, and they need some help. Um, you know, we talked a little bit about, you know, what a company, a company's experts should be working on solving their company problems and not the not the utility aspects of their problems. And they should be requiring that. Yeah. So so let's talk about Pipeshift then just a little bit. So Pipeshift comes in and you guys are saying look we can take over those aspects of operations the DevOps, the MLOps, the other things. Just what does Pipeshift do. Can you can you sort of summarize it for us and maybe we'll dive into it a little bit? Yeah. At core what we offer companies is, uh, like an end to end orchestration platform for all their genai workloads that are based on open source components. These can be embeddings. These can be vector databases and most importantly, these can be models which are language models, vision models, audio models, whatever form of models they want to actually deploy on their own infrastructure. And we take care of the entire tooling behind it. So we everything from how do you run them most optimally? How do you scale them up? How do you scale them down? How do you schedule workloads? How do you, uh, have load balancers set up so that if you're scaling up, suddenly your GPUs don't go down and crash? All of that is taken care of by our platform and our orchestration framework that runs across different clouds and different environments. Um, and finally, we have this proprietary framework that we have built out, which is called magic. Uh, funnily enough, it's actually modular architecture for GPU inference, GPU based inference clusters, um, shortly magic. But that's that's because what we have understood is truly orchestration in this new age of workloads has to be modular at core. There is no one size fits all. If a company is coming and telling you that, here's an API. Use this API. That doesn't work, because maybe you're on a use case where you're running voice bots. And for voice bots, the single most, uh, critical, uh, point that you've got to optimize for is latency. But maybe you are an insurance company that's just processing hundreds of thousands of documents on the back end every single day, and no customers are actually seeing the end. Uh, no customer is having any user experience issue on that. If you're running it throughout the day or throughout the night. Um, so there, what you need to truly optimize for is throughput of how many tokens can I actually, um, get out of my GPUs at any given point of time? Latency is just out of the window as a limiting factor over there. So I think if you look at the whole stack like that and your use cases like that, uh, then you'll always find there are 1 or 2 of four factors that you'll that you'll optimize for cost, throughput, latency, and, uh, quality. And that's truly what we allow you to solve with Pipeshift s. Core in infra that you get a modular stack that gives you an orchestration across your infrastructure, and an inference and training stack that sits on your cluster in that environment that is optimized for the workload that you are running, so that you get the highest dollar efficiency from your for your existing infra investments, keeping your CapEx low. So you don't need to invest a lot in infrastructure. And also going forward, keeping your opex low so that you have the minimum amount of investments and recurring costs in terms of keeping all of that up and running. So we're really talking about maximizing the utilization of a GPU, but not in terms of its raw utilization, but in terms of its power to help the company. And we really need to deploy models that we've set the parameters for the service levels that we want, really for the throughput of that model for the latency, you know, the response time of that model or how much that model is costing us. Basically, how big is it? How long is it running? How is this, uh, and, uh, you know, like I said, the quality of answer because, you know, that's not something you normally set on a normal business workload. It's like, I want the correct answer normally, but with AI models, you know, there's there's there's good, better better, better, better, better answers depending on how much you want to invest in that. Right. So those are all different levers you might want to pull as someone doing MLOps or DevOps on on these things. And you guys are just providing an interface now where someone can just make those service level settings from an IT perspective happen, and then you'll schedule and balance those workloads, uh, into the model, uh, environment, into the deployment. As, as as is the magic, as you say underneath the hood goes. And that has to do with a lot of having peaked at it, a lot to do with like memory optimization techniques, which is something that other people don't really consider terribly much. Right. The API doesn't really ask you how you want to optimize your GPU memory, uh, on these services, so that's cool. Um, there is a lot more we could dive into. Uh, you do have some resources, though, for people who want to like, look, look at this and maybe understand just how much they can accelerate their AI infrastructure deployments and get back to the business of their data science and data modeling and not have to worry so much about this end of things and maximizing their infrastructure usage. What where would you where would you point people at Arko, uh, to, to go and get more information? Um, I would probably again, because I think you added on this that it's a little bit, uh, subjective to every customer. We, we tend to kind of share the right resources with them based on, like, their context. We have our website. Otherwise, if they want to just have a call with us directly, um, we have a call link on the website or they can just hit me up. I'm at AC@Pipeshift.com. Um, anyway, they can just reach us and we'll definitely have a chat. Give them the best, uh, plan that allows them to optimize their infra. Because at core, I think what you mentioned, what they should focus on is data, because that's paramount to them, because AI at core is data. It's garbage in, garbage out. So their team should focus on data while we take care of the tooling for them, so that their team's data can becomes a model and a business outcome. The fastest with complete flexibility and control. Right? So about making your own AI infrastructure efficient and handling all the all the the underlying knobs and bells and whistles and integrating a ton of different things that people would otherwise have to deploy on their own to try to do this pretty complex environment operations if you're building it from scratch. So don't do that. Look at Pipeshift and Pipeshift.com thank you for being here today and explaining that to us. Thanks, Mike. Thanks a lot. Thanks for having me. All right. Check it out. Take care.