Mike Matchett of Small World Big Data talks about machine learning with Aaron Friedman, VP – Operations at Wallaroo. In this discussion, Friedman explains the importance of the deployment phase when working with ML platforms. When data is stuck in development, training, data wrangling, and shaping, precious time and ROI are wasted. Wallaroo AI is an enterprise platform for production AI and facilitates the last mile of your machine learning journey – it's even designed to be imported directly into whatever notebook your business uses, such as Databricks, Zeppelin, or Jupyter. Your data is designed to be in production – keep it that way with Wallaroo.
Mike Matchett: [00:00:00] Hi, Mike Matchett with Small World Big Data, and I'm here talking about one of my favorite topics, which is machine learning and how do you get machine learning really to that useful point, we know lots of people have projects that have ideas. They have data scientists maybe exploring their large and growing data sets. They might have be stuck doing feature analysis, but how do you really get those machine learning models into production usefully? And once you get them there, how do you maintain them and use them at scale? So this ends up being one of the key problems with machine learning in larger enterprises or with bigger data sets, and particularly for people who can really benefit from machine learning that really depend on those models for really massively scaled benefit. So we've got today a company called Wallaroo. I've got Aaron Friedman, who's the VP of operations with me. Welcome, Aaron.
Aaron Friedman: [00:00:49] Thank you. Welcome. Glad to be here.
Mike Matchett: [00:00:51] All right. So Wallaroo, you know, we talked a little bit before, doesn't really have an origin for its name. It clearly sounds like something from Australia. We're going to get past that. And you have a little wallaby face on the on the logo. So we'll go past it. But tell me a little bit about yourself and Wallaroo and why you guys looked at this machine learning pipeline problem and said, this is where we should stick a fork in it. Why in this deployment phase?
Aaron Friedman: [00:01:17] Sure. So the history of the company is our founder. Vision actually came out of high frequency trading, where he actually operated as one of the quants to actually do a lot of replacement of the day traders. And so what came out of that is the data scientists quants, as they called them, would actually come up with a model and then they would have to hand it off to another team. And that could take, you know, days, weeks, even months sometimes to take that model that they came up with in order to get it into production, right? And then once it was in production, they had to figure out like, is it actually doing what it said to do? And is it making us money? Is it losing us money, so on and so forth? And then high frequency trading? I mean, you're talking about millions of dollars a minute. Right.
Mike Matchett: [00:01:59] And so was going to say, how are you trading the time, the time windows for doing something correct? Are milliseconds and you're talking about deployment of a model, microseconds even. And you talked about deployment times for a model that can be days or weeks. That doesn't even make sense. It doesn't mesh.
Aaron Friedman: [00:02:14] It doesn't. And so what came out of there were two things. One was an open source language called pony. And one was the actual original form of the of an engine, a purpose built for ML inferencing called Wallaroo or the Wallaroo engine. So kind of fast forward, you know, vid actually has ownership of this, and he actually creates the wahler engine and we transfer from Cony into rust, and we basically did that in order to do a couple of things. One is if you look at other things in the market, how people are trying to solve the last mile of Ml, how they're trying to get ML into production. And let's be clear, the only place you get r.i out of ML is when it's in production, not in training, not in data wrangling, not in the science experiment, in production. And you're only as good as the last time that model was deployed. And so if you're looking at that, then I need to be able to deploy in seconds. I need an engine that can actually inference in microseconds. I need to be able to actually move and see and have insight into that engine if the data is change, anything to ensure that I'm still providing that business value. And the last thing is is there's no such thing is really one model. There's use cases, and most use cases are a combination of multiple models. And so you have a concept of pipeline, which is one model handing off the output of one model to another model. There's the concept of like, All right, I need multiple pipelines in order to be able to do the use case. A great example of that is dynamic pricing. One of our first customers needed to solve how do I do dynamic pricing for a region and I've got 1800 regions that one dynamic price use case turned out to be seven models that we had to deploy across 1800 regions, and we can actually do that deployment literally in seconds. And that thing can actually run really in under. I think it's I think it's down to like 30 seconds to be able to do that.
Mike Matchett: [00:04:09] All right. So let's talk about this. So there's really a couple of aspects here that are important. One is the the speed of deployment getting something into production. So let's dove into that in a minute. But first, let's talk about just the sheer performance that you're offering of the model in production is also another key part of this. You mentioned you use this language, a lot of people are hyped up about called rust. You know, we know it's it's it's this cool thing that allows you to write in some comfortable language, strongly typed and highly highly proficient. What do you guys? What are you guys like about rust versus some other highly efficient language? Well, I want to
Aaron Friedman: [00:04:43] Be clear, like rust is one aspect that gives us a performance. The other thing is that we're leveraging C libraries and we're basically taking any type of training framework that the data scientist wants to use and then saving it into something that the well engineered. Stands so that we could use rust, which runs it has C speeds. Leverage is a C library, so that way we can run as fast as possible.
Mike Matchett: [00:05:07] Okay, so, so so no matter what, I built the model in upstream your, I don't know, trans compiler, but you're converting it into into the language that you guys run, which is highly efficient and fairly low level, right?
Aaron Friedman: [00:05:20] That's correct. And so and the reason we did that is we want to make sure that data scientists can use the tools that they know and love and get out of their way and let them say, Hey, you know, experiment with what you want, whether that be psychic learn or XD boost or bolt rabbit or profit hugging face and just go do what you need to do. And then we'll handle all that back in piping and plumbing to basically allow it to scale and to be able to run it as fast as possible.
Mike Matchett: [00:05:46] All right. And we'll maybe touch again on performance here, because that's just really a key part of this. But let's move to that other half, which is getting this into deployment fast. What what is it about Wallaroo that is different than the normal pipeline tools that people may have adopted that you know, you build your model, you have a catalog, it goes into there, it tracks features, maybe and does some things and then put some model out there. What is what are you guys saying? Hey, that's not sufficient. We've got to we're doing it this way.
Aaron Friedman: [00:06:12] Yeah, so a couple of things. So one is we actually built an SDK that actually sits in front of a bunch of APIs that can just be imported directly into your notebook. And that could be a Databricks notebook. That could be a Zeppelin notebook. It could be a Jupyter notebook, right? And that way, you can interact directly with Wallaroo inside of that inside of that environment that you're comfortable with, right? And so for Wallaroo, once you have a model, once you've saved that model to get it into Wallaroos is literally upload and then run or execute, and that model is up and live right directly out of your notebook. What we're seeing that other folks are doing differently in the market or honestly not that differently, is to try to take Open-Source tools out there and put wrappers around them and then basically have an entire end to end ML platform. Again, we're only focused on deployment. We're not focused on development or training or data wrangling or data shape, right? And so what you end up with is either folks who are trying to use engines like Spark, which really weren't designed for ML inferencing in order to do ML, or they're trying to take a DevOps approach and put wrappers and containerized the model itself and then deploy as many containers as possible in order to make it work. The unfortunately, the downfall of that is like, you know, if you have one model or maybe even 10 models, that's manageable. When you have 10 models that need to talk to each other from a model pipeline that becomes very complicated when you start dealing with tens, you know, 10000 models, 100000 models, it's impossible to manage.
Mike Matchett: [00:07:42] All right. So there's this thing forming in my mind and probably need to dove down to some white papers at some point on this. But there's this idea with with the current state of art in terms of machine learning pipelines that people might use. As we were talking about, there's definitely a lot of upstream things, whether it's like you mentioned Spark or, you know, from Databricks and there's tools from Google and tools from Amazon and tools from a couple independent folks and Python tools and so on. If they try to tackle the whole end to end thing, then you have data scientists really doing the DevOps job and the deployment job and the rest of it in there. And they're kind of getting into territory that they're not really expert in. And here you're saying, look at the point of production, this is the point of the sword. We have to put this into a production quality system that is tuned enough for financial services, right? This thing like SEG comes out of this where that engine is hyper performance, hyper scalable. But just let me ask you this, though you said something curious about containers and you're not necessarily putting one model per container that does seem to be sort of where everyone's going. Well, what are you? What are you doing differently that you're not just going to put one model per container? Why not?
Aaron Friedman: [00:08:47] Sure. So reason what we've done is actually we still deploy inside of Kubernetes. We made it so that I can actually put multiple Wallaroo engines inside of a single container or inside of a Kubernetes VM for lack of a better term. Right? And so in that instance, if I had a 16 core VM, I could literally put four wheeler engines inside of it. Each Wallaroo engine can actually handle multiple model executions and pipeline executions. So therefore I have a reduction of footprint. I have a reduction of infrastructure cost and server costs. And the reason we did that and reason we say that's better than actually just continuing the one model and then scaling an up or down based on what you're trying to achieve or your throughput is that you're you're not getting efficiencies there. You're not getting, you know, they are not being able to actually go through and say, like, All right, let me have one VM that's got four engines that I can execute, you know, 20 models against, right? And the only way that you can actually achieve that is if you start from the ground up, you actually actually have to build an engine from the ground up that can do high scale that it can do across computing. And so that way, everything is happening inside the engine as opposed to, Hey, I need this. This container to do one piece. Hand it off to the next container or hand it off to the next piece and then hide that behind an API for an application to actually be able to use.
Mike Matchett: [00:10:09] All right, so you have this fairly low level engine that can handle model graphs, I guess model pipelines within themselves. So you're not just taking the RTO from one place to another, place to another, place to another place and making this this thing. And so that sounds very efficient, but it also sounds like it's a piece of hardware. But this is not hardware, right? Again, rolling backwards. You wrote this in rusted software. Where does this thing run, then? If it's not, if it's not a container per model, what? What is, what is what is it run in
Aaron Friedman: [00:10:36] The other engine? So we're installed software. It actually installs inside of Kubernetes Sagi. Yeah. So we didn't either. So it's actually the engine itself is running inside of Kubernetes VM or a container itself. Right? Just I can put multiple inside of there, and that can be in cloud that can be on-prem. The engine itself can actually run an edge. We actually have an edge version of it coming out soon, as well as a community version coming out really soon.
Mike Matchett: [00:11:02] All right, so definitely, definitely an ability to put that there. It does seem like if you're going down this route, you're aiming at people that have a lot of models. Just tell me, tell me a couple of examples of folks and how many models they need to run to actually make their machine learning pay off.
Aaron Friedman: [00:11:17] Well, that's a good question. So really, you need to run one model in production to get the Ebb3. It just depends on what that model is, right? Or that use case, I should say, right. So I give the example of dynamic pricing, which is actually seven models across multiple regions, right? That ended up being, you know, for that use case, over 10000 models. I know of one other client that is trying to do a loyalty program that is roughly about 100000 models because they has to be deployed across the United States and multiple stores. But here's, you know, kind of let me rewind a little bit. You can actually only have one model and still see the value of Walter. You could be trying to get that first model into production. It's just more like we're focused on. Once you're ready to go into production, you can see the value in us and we can scale with you. And so you are correct. The folks that are the most interested in is, you know, sixth largest bank on the planet is definitely become a client and they see the value in us. They actually did a bake off between us and Databricks, and we came out 13 times faster on 80 percent less infrastructure to do a really complicated security model.
Mike Matchett: [00:12:25] All right, right, right. Let's focus on that just for a second. So not only are you faster. A magnitude faster there, it sounds like, or maybe two and a half the cost dropped tremendously, probably for a number of reasons, both for the amount of infrastructure you need. Obviously, the time to do the inferencing could have a business cost, whether it's a millisecond or microsecond on the air, depending on where you're at, the the amount of memory, the amount of instances you need, the amount of containers you support, the amount of cloud infrastructure you'd have to basically rent. That all drops tremendously when you have this kind of approach, right? That is correct. No, right now is is do you find that folks tend to focus on this as a mostly for their streaming kind of approaches? Or, you know, a lot of folks are still kind of doing this batch kind of interesting. Where do you guys where do you guys support the best?
Aaron Friedman: [00:13:15] We actually support all three. So that being micro batching, streaming, in fact, that security use case I just mentioned that was all batching as it's happening and it's rolled up. On the flip side of it, dynamic pricing is real time.
Mike Matchett: [00:13:29] Ok, so regardless of the sort of the architecture of your model pipeline you got, you guys are right in there. And this would, by the way, if you're if you sort of drawing this out in your head and you're watching along, give you a consistent point of production, no matter what your upstream data science teams are using and allowing you to have multiple upstream vertical teams doing different things and all coming to it or coming to production onto the same platform, which is actually a big benefit rather than having them all on their own.
Aaron Friedman: [00:13:56] We see a lot of that, actually. So folks that are, you know, huge snowflake users and they're actually even developing models inside of Snowflake, but they don't want to execute those models there. We see folks and definitely in Databricks H2O data robot, you know, we are very dedicated into integrating and not actually saying, Hey, you have to rip and replace in order to use us. There is absolutely legitimate reasons why you'd want to use Vertex III or sage maker in order to develop your model, but not execute your model in those. Because it's more expensive. They don't give you the latencies and slays that you need, and they don't scale appropriately for what you're trying to achieve with your business.
Mike Matchett: [00:14:37] All right, so what I really like, you know what you're really talking about here? I just think you said this earlier to me that machine learning was hard, but the last mile is really hard and you guys are just hyper focused there. If someone is really interested in learning more about Wallaroo, maybe looking under the hood a little bit, what would you have them? What would you have them do?
Aaron Friedman: [00:14:57] Sure, the easiest place is Wallaroo II. We have a ton of information up there. We actually have a blog section in there and there's actually neat Wallaroo, something different and really talks through how we built everything from the ground up, but also has a great blog on why we chose rust and moved away from Pony. And so there's a lot of good information there. You're also going to reach out to me directly at Aaron at Wallaroo II.
Mike Matchett: [00:15:21] All right, that's awesome, Erin. So those of you out there who are watching this, if you have machine learning initiatives in your organization, if you know people are struggling to deploy models and production in different divisions or what or even in just one, even if there's one model, it sounds like Erin sounds like you guys have something that you could really offer to optimize that and make it really production quality and production ready. When you put that kind of data science, he kind of model into operations and really get it, get it going. So. And also, I, you know, I forgot, I forgot Aaron. One more thing. You guys do a lot of observability about this thing and help tune those models and production I meant to focus on because that's what I'm saying. We're going to have to do another spotlight on that guy. So take care. Look into that if that's something you on there. Appreciate it, Erin. Any any last words?
Aaron Friedman: [00:16:06] No, as you said, like we do anomaly detection and experimentation frameworks to support all this for model insights. So again, everything at Walter Reed about and I had been more than happy to chat with you.
Mike Matchett: [00:16:17] All right. Take care and check it out.