Transcript
Hi. Mike Matchett with Small World Big Data. Today we're talking about something that I have spent a long time, a long part of my career doing, which is performance management in a way. We're talking, though about it in the modern terms. How do we how do we get in front of all the data that we're creating? Uh, that looks at application performance, system performance, everything else. It's just floods of stuff coming out of, say, our Kubernetes environments and so on. We're going to be talking with ControlTheory here about that in just a minute. Uh, but it is going to be really interesting. So stay tuned. Hey Bob, welcome to our show. Great to be here, Mike. All right. So you've started this thing called ControlTheory . Uh, you know, I've got a, you know, mentioned I've got some history in performance management, capacity planning going way back to when servers were pets and cattle was something we ate. Um, and now, uh, the world has changed forward quite a bit. We went through virtualization. We went through cloud hosting, we went through VMs and instant cloud instances and then microservices and things. And it's and the landscape has really changed. Maybe you could just catch us up a little bit on, you know what, what are some of the key things that people should be thinking of today as, as they get into as they try to bring those old perspectives on on managing performance and things into this new world? Well, for sure, you know, we when we started ControlTheory, we really were looking at this concept of observability. And observability over the last 5 to 10 years has taken a lot of those concepts. You talked about monitoring logs, metrics, traces and put them in kind of a platform, um, and, and actually elevated the space. We thought, you know, actually, there's been a lot of innovation that's happened in this observability space to look at your infrastructure, look at application performance, uh, look at end user experience. All these things have come together. Um, you know, our our previous startup was a company called, um, before stacking was a company called Copperegg, and we did cloud monitoring. Um, this was a while ago. Um, and we kind of watched the cloud monitoring space evolve. We competed with a little company called Datadog back then who actually has done quite well since then. We were acquired. And, um, early on in the process, but we've seen kind of that market grow and, um, really become very consolidated. And that's in some ways, uh, I think created a lot of, um, opportunity. We kind of look at, you know, they went through an innovation phase. Now they're going through a consolidation phase. And there's this element now of a lot of frustration in the market. So, you know, as people are looking at observability, a lot of people are trying to deal with cost issues, you know, telemetry volumes, complexity and trying to bring that together with improving root cause analysis, reducing enter. And you know, we're looking at, you know, as a company now is how do we move that industry forward and provide more control. This concept of controllability, which we'll talk about more, um, you know, into the market. And I think that's really what people should be looking at is where's the market going? How can we make it better? There's new open standards to consider, and these are all things we can kind of unpack as we talk through, uh, you know, what we're doing and what we think the market is, is moving towards. I know one of the more interesting things is you bring this concept of controllability and feedback into something that we might have in the past, just considered pretty passive, right? We're looking at volumes of data coming. Uh, but tell me, just before we even dive into what what what you do in terms of observability, uh, what what's the big challenge in observability today? I mean, you know, obviously there's a volume, there's a thing, but. And aren't there solutions that that sort of address this already out there that are more or less open source? Yeah. So we think, you know, the biggest issue that we see immediately. And it kind of surprised me, um, how significant it was. It surprised all of us as we went through the last year or two of really talking to hundreds of customers in this space is really the the escalating and spiraling costs, um, that are everyone's dealing with. And these are people are now paying six, seven figures, you know, 20 sometimes 30% of their cloud bill is related to observability. And I thought you meant I thought you meant the the cloud bill itself was spiraling out of control. But you're talking really just to do the observability just. The observability part. And that's and that's really kind of, you know, not only, as you know, cloud costs, um, and the cloud cost management market has really grown. And that's been a real focus probably for the last 5 to 6 years. How do we control cloud costs. Now people are looking at observability has grown um, even greater at a greater rate than we think the cloud has. And um, so that's that's the biggest issue that we found. Um, and it's a, it's a side effect or a symptom of some bigger issues around efficiency within observability. And because not only are they paying too much, um, and these costs are unexpected, these bills as plenty of of news out there about people getting million dollar bills they didn't expect. Um, a lot of times you don't know, um, where the issues are to get your bill the next month. Um, and it reminds me a lot of the inefficiencies before we put a lot of controls and orchestration automation into things like Kubernetes. Um, so, um, the second area we found and kind of where we started, was that you're really not getting your ROI. The results of the observability are supposed to be, um, improving your root cause to detect problems better, faster, find problems faster. Um, mean time to repair should be coming down. We talked to, you know, CIOs, CTOs, VP's of engineering. They're not getting the data they want out of their systems. They want to know about more about availability and performance and health and Cogs and and these higher level metrics. And so there were two problems. One is, you know, the cost, which is, you know, creating a lot of frustration. That's that's something that needs to be fixed almost immediately. The second one is even bigger, which is, you know, with all the stuff they're paying, they're really not getting the ROI they want. So it's kind of a two sided two sided coin there. You know, in, in just just to put a pin in it. What's the cost. I mean, I mean, in the past you used to license an expensive product to do, to do some of this work, but today is it because it's a subscription and it's based on, uh, some, some volume or. Yeah. So the. It's. Yeah. So it's based particularly on how much you're ingesting. So the throughput or the flow of information coming in, uh, the analysis or the indexing you're doing of that information and retention of that. Um, so those are three of the biggest cost Factors. The other fourth cost factor with observability today is around custom metrics. And these these are the metrics that are the ones that you as a business care most about. And there's problems about how you structure those. So those are the most expensive things that the observability vendors are charging you for. So not only do they want all your data to come in and then they charge you for that, but the stuff that's of most value is the stuff that costs the most money. So there's two major frustrations right there. A lot of it's because the the pipes that are coming into the observability pipes are stupid. They're one way fat dumb pipes. And so, you know, and that was kind of where we looked at it, you know, a year and a half, two years ago, as we were looking at this marketplace and the technology hasn't really changed. It's dumb pipes sending data into a database or some retention system in a dashboard that goes on top of it. And, and then you have they charge you for that indexing, that ingestion, that retention. So we said there's got to be a better way. There's got to be more intelligence we can put into the system. And that's really where we got started. Say, if we just had more intelligence of what to send, when to send it, who to send it to, we could actually help them lower costs, but also get more of the important data up to the analytics. We're trying to solve the problems like root cause or mttr. Right. So it sounds like you're describing kind of becoming an educator of the data coming off of a complex and maybe distributed hybrid microservice system, you know, tens of thousands of microservices all producing potential log file data and, and the downstream tools that somebody has to subscribe to at great cost. Sometimes if they take all that data that's being exhausted out of these systems and putting it in there, the cost goes way up. So are you saying that you're going to sit in the middle and try to add some intelligence at that point? Is that right? So yeah. So, you know, when we looked at kind of how to architect the solution, first we said it needs to be open. So Opentelemetry, which is a new cloud native computing foundation structure. It's a whole platform for collection and instrumentation, and that's been going on for 3 to 5 years. It's reaching a good level of maturity. We thought, let's start with Opentelemetry because that is really kind of as we think observability is changing. That's the first domino to fall. So we looked at Opentelemetry, which can work with all the existing instrumentation that's out there, which is great. So you don't have to change anything. But if you also start adopting an Opentelemetry standard, it gives you much more flexibility removing, you know, lock in vendor lock in proprietary solution problems that have occurred. And that's the that's one of the hardest things with current vendors too, is that you're kind of locked in. Once you're in, you're it's like a roach motel. It's hard to get out. So, um, so we looked at that open set of solutions and said, can we actually build a control plane on top of that with this idea of feedback loops providing more intelligence? Um, and that's those two concepts of a control plane that sits on top of the data, which sits in the middle. Does filtering. Deduplication can do correlation, sampling, enrichment. There's a whole range of processes and transformations that can occur before the data gets in there. So only the important data gets in. So you only get charged for the things that you need. Um, more importantly, you can actually start using this this middleware so to speak, or this control plane to send the important data when you need it and refine that and continuously learn. And that's really where the concept of controllability came in. And the idea of adding that control loop in. Yeah. So so just to give us an example or two of active control of whereas again, I tend to think of collecting data and maybe filtering it as more of a passive or reactive. Right. Just give me give us give an example or two where there's some control. Yeah. So you know, one most common problem is I have a, I have a log spike that happens because of a variety of different reasons. You know, developers put in the wrong debug statements or information statements, and we've heard people find out about this, you know, a month later when the bill comes in, or maybe, you know, it happened on a Friday, they find out about it on a Monday. And these spike bills, you know, can suddenly flood and cause, you know, higher charges, higher costs, um, and actually can, can hide other, other issues that are happening. Um, so the first issue is can we detect those, those spikes that are happening and alert you to those? That's kind of phase one. Um, but can we put more controls in so they don't happen unless you want them to happen? Uh, can we filter out the debugs and information statements, or can you have an incident or root cause system actually signal that I want that information, and that's a nice feedback loop that could occur. Is that please give me more information on troubleshooting now. So turn on debug for the next hour. Okay I'm done. Let's turn it back off. Versus having to do this unexpectedly. So all that intelligence is not there right now. It just it just happens. Um, and, you know, with controllability, it could be automated. You could set, uh, we have the ability to actually, you know, set baselines and, um, understand what's normal. There's a continuous improvement, you know, what's a spike? What's not a spike. Okay. And also then be able to understand, you know, maybe we need to send more data because it's a new feature. So it's okay to send more data. So all that intelligence can be built into the rules and the policies within the infrastructure. Um, it can be very simple. Human could be in the loop. It says. Yeah, turn it off, you know. Um, the other thing that's super interesting about this for us is that, uh, people don't have a good understanding of their observability. So the first step for a lot of folks is really understanding maybe where that spike came from or where that cost came from. It could be a bunch of logs in this case, or it could be a bunch of traces that you're running. Um, and you want to know what's the attribution. You want to know which Kubernetes cluster came from maybe. Which application? You know which development team. Right now they may not have that attribution. So we can actually trace that back and put, you know, enrich that information, be able to understand where it came from. So, you know, oh, it's a spike. They just had a GitHub push. You know, this is a new, um a new release we have coming out doing an a B test. This is okay. Um, or maybe they left it on, um, from a feature release about a month ago and is still running. That happens all the time. People forget, the developers forget to turn it off, and it's in the next release and the next release, etc.. So these stories happen to everybody, and you're always, it's like a whack a mole. You're always trying to figure out where it came from. There's no controls, there's no visibility. And the bill is usually maybe a CIO or the accounting team. Why is there a 10% rise? You're like, huh, you have to go back and do the post analysis. So trying to put that intelligence in there is really the the solution. I mean, that's interesting. So I just it just makes me think, though. You have a control plane now that's sitting between the customer system and their, their observability platform, I'll just behind and leave it at that. Right. So that's whatever Datadog whatever. You're not trying to replace the observability platform itself. Yeah, that's. A great. Point. So you're sitting in the middle there. But does that mean that you get all the data and have to filter through it. And do you see that on your on your thing? Or is the implication of a control plane that there's. Yes. Yes. That's. Yeah. So architecturally it's, it's uh, you know, the control plane data plane pattern is one that we learned back at Oracle Cloud. The last company Stack engine got acquired into into Oracle Cloud. It's a very scalable pattern. Um, and so the data stays with the customer. Um, we, we actually add a collector and some controls within that. But we're not looking at their data. We're actually a control plane that sits on top of it. So they own their data. They create, they have their own agency. It could be on prem. It could be up in the cloud. So the control plane basically is just controlling and signaling into their data plane. Um, we actually add some collectors if they don't have collectors and have some control mechanisms in there. But the idea is that, uh, the data stays with the customer, and our control plane is actually sitting above that and signaling getting signals in and signals out and managing the data. So the data stays, the privacy stays, and security stays within the customer's environment, which a lot of people like now because they're all worried about why should I send all my data to someone else so they can train their LMS? You know, I want to be able to control my data. I want to know where it goes. Um, that. Well, and it's a much tougher kind of SaaS service to offer if you've got the data flowing through your system. Right. And the scale is the scale is an issue. The pricing becomes an issue and allows us to not price by volume, which everyone else prices by, but more by control planes and agents and by, uh, by the intelligence that we're providing. So. Yeah. Yeah. So so I mean, the vision I'm getting is that this allows someone to be very more intentional and deliberate about what, not necessarily what is originally generated by the system at first, but by what gets passed through. But it seems like that enables someone to then also be more deliberate and intentional about how they're instrumenting their system, their system to start with. Maybe you could explain what that might unlock or what kind of behavior benefit there is. And the, um, the source of all this information goes back to the code itself. And Opentelemetry provides a lot of existing instrumentation. And what we're able to see within the control plane is if there's missing information, if there's problems, can we actually signal back and be able to, um, and identify code that should be changed? If I'm looking for root cause, the information is not there. We can identify what's what's missing and signal that back into the development cycle. I think so that's you know, if you take a look at the whole sort of software supply chain, it goes from, you know, source code into collection, into analysis. Um, we want to be able to have control points all the way through there. Our first point is really getting it in-flight. So you're managing your caching where the problem occurs. The second phase is actually can you now go back to the source and fix it? Um, so it doesn't happen again? Um, and that requires you understanding where, where the telemetry came from. Um, potentially what's, what's the problem, where it changed and then, um, working with the root cause side to be able to figure out what's missing. And I think that's actually, um, the next major opportunity is not just catching the problem and fixing it in flight, but preventing it from happening at the code level and then setting more policies and having more, uh, gates on your code should meet these standards for observability. And if it doesn't, it needs to go back in. So policies around what's allowed, what isn't allowed is your observability Survivability tracking the right IDs. Is it showing itself? You know. Is it related to the the releases that are going on? Um, so all those things actually allow you to then have much more intelligence across the board? And it is a full spectrum solution, something throughout the full, uh, the full cycle. So, yeah. I mean, I mean, what I, what I like here, uh, just from my, my, uh, dabbling in, in programing and scale and scale up in complex environments is that if I in the past had wanted to change, say, the level of logging or the level of debugging or the level of instrumentation, I want every new microservice to kick something out and tell me about, or it's too much. And I got to go back and reedit those and redeploy them and take those. You know, you comment out the debugging lines, right? You just you just said debugger levels. You actually went back in and you see you have all these comment codes. Go back and uncomment and uncomment them if you want. It's like that is a pain when you've got tens of hundreds of thousands of microservices, right? So it sounds like we can get to a better best practice here and move that, move that decision making on what to collect, when to collect it, when to when to do and when to keep things up into your control plane. Uh, really enabling, uh, a little bit more developer productivity and creating a better sort of standard practice and saying, here's, here's what you do down there. Yeah, exactly. And we also, you know, as we look forward with new copilots and new AI enablement on the development side, more code should be coming through, maybe more code people don't know about. So having these controls in place is going to be important going forward also. It's already a problem. Um, yeah. So so we assume that, you know, we call it Winter is coming. You know, AI is coming. It's going to be, you know, more issues to, to solve. Um, on top of the ones that we already have. So. Right, right. And again that, that that'd be great to be able to say to standard like if you're doing something that AI code generation, it's got to do X, Y and Z and don't worry about it. We'll take care of whether it's something we preserve or archive or kind of thing. I know, I know, there's things you're doing about sampling. Um. I saw something where you, uh, you know, another example is, you know, I've got I've got something that I'm normally keeping a certain aggregate level of information on, but something gets abnormal. As you said, I can turn on a more detailed trace for that example or keep the error trace. Yeah. That's deliberately. Yeah. That's that's another great example is, you know, when people have an issue around too much data, a lot of times they'll just do random sampling, head sampling where they take maybe 1% of the data and they miss the information. This is the standard practice today in observability. Uh, one of the things we can put in place is tail sampling, where it's looking at the full trace, for example, and figuring out, oh, there was an issue, there was an error. So I'm only going to send error conditions up. Maybe I send all my other traces off to AWS three for what they call cold storage. Maybe I'll put it into less costly storage. And you can do this with your logs too. Is so you want to have it for compliance later or you want to do later. Regression analysis. Send everything off to something where you're not going to spend as much money. You can also you can also rehydrate it later and bring it back in for analysis. So I could take this great volume of data and not just drop it on the floor, but I can fork it and keep it and keep it. So if I've got some security or compliance things that I'm also addressing with my observability instrumentation, I'm not necessarily constrained to, oh, I have to now keep putting that into Datadog or whatever is more expensive, but I can keep it by. Yeah. I'm trying to come up the right word for that, but yeah. There you go. Yeah. It's like we asked ourselves, why can't we do this now? I want to just send the important stuff and everything else should go over here. But now everything has to go to one place and you pay money for that? It costs. That's where the cost comes from. But also it gets lost. And if you're also doing that head sampling, you're going to miss, you know, the important spikes. So you know send the important spikes up. Send everything else somewhere else. That rerouting and filtering is just, you know, better smarts, better intelligence. It's making observability better. Lifting it up where it needs to be. And then you can solve problems better. So we're getting to that next phase which is root cause analysis and reducing mttr and having better availability information and better performance. I mean, we're not even talking about enhancing the effects of actually doing a better job of performance and availability management. We're just talking about, you know, just even if we stick with the enablement of it, I think we're doing well here. Uh, so, uh, it sounds like, you know, if someone gets to a certain size or complexity that you could enable developers to just do a better job of instrumentation by saying, do this and that and that, that, that job, then of determining what to keep, what to look at, when to do it, how to when to drill down, when to do that becomes something that you can pull out and now make explicit, um, right. How long does that how long would that take to implement something like that on a standard cluster? Yeah, it's a very fast implementation process to actually, um, you know, install the information, install the collectors, Um, having these policies and people kind of look at it maybe in phases, maybe they'll have a human in the loop to make a decision. Um, maybe they'll say at first, let's just figure out what everyone's actually doing. So maybe they'll do like an analysis of all the data coming through and see what the state is and then be able to feed that back into, you know, platform engineering is a great example, trying to work, you know, from this of what we need back into the development and setting policies, but also tracking how they're, you know, how compliant you are to those and looking for those spikes where something got out unexpectedly. Um, that could happen very quickly. It's it's pretty seamless. And, um, the, the beauty of that feedback loop is continuously learning so it actually can adjust as you go. Um, you know, that, you know, sampling and being able to understand exactly what's happening. Um, people understand, you know, setting, uh, auto thresholds. I think those are just, you know, you know, it's common sense. It's how how things have worked and observed all along. Let's move those closer so we actually detect problems, send them, but also move that information back into developers. So, um, you know, intelligence continues to move back into work and actually have the most value, which is back down to the source code itself. Yeah, yeah. Because I mean, I've heard you a couple times now say, look, it's not just about cost, it's about reducing that mttr at the end of the day. And I'm like, yeah, yeah, yeah. I mean that's assuming, you know, like. People, I mean, the first thing to solve is like, I want to stop, stop the spend and I want to prevent it from happening in the future. Now, now let's go back to figure out why that happened and make it better. Um, and I think that's, it's a, it's A12 punch that's required there. And I think that's where the, where the market is going is, you know, let's, let's get more efficiency there. Yeah. I mean, I mean, just you're just really enabling with this kind of thing, just, you know, being able to send data streams in multiple directions, you're enabling a more robust ecosystem to grow around those higher value disciplines. Right? It's like if I have to do everything in one tool that's downstream. I'm dependent upon that vendor to have done that on that solution. But now if I can also say like, hey, maybe I can run, uh, run this stuff through, you know, some LLM thing I just developed over here. Maybe. Or maybe I can run this through my security, uh, tool tooling as well. Look for security events and and maybe I can use this for, um, you know, as a more sophisticated cloud capacity planning in addition to root cause analysis. Uh, you're really saying you're really enabling an ecosystem? I think I think that's I think that's a great you know, it's not simply just a control, a ControlTheory . It's a control point. Right? Yes. And it's I think it's an evolution of the space. You know, you you democratize and open up the collection with open telemetry. You put a control layer on top of that for intelligence. And that enables this next generation of observability to happen. When we start focusing on the you, the user don't have to deal with all that manual work that we take care of. And also. Right. And you're not separately, you're not going back to the developers for each little thing and saying, now add instrumentation just for this. Right. And it's not like you're going to have 18 different instrumentation pipelines of data coming off this, right? That would be inefficient. So. Right. Um, so I think there's a lot here. Uh, you know, there's certainly there's certainly room to, uh, to take a look at that. What would you if someone wants to learn a little bit more about ControlTheory and start to, uh, look at it, maybe even experiment with it? What would you. Where would you recommend they start? Yeah, well, I would hit ControlTheory . We have lots of information there. Um, a lot of how to videos, um, that kind of walk you through what we do, but also just, you know, getting people up to speed on things like Opentelemetry latest standards that can educate themselves. Um, reminds us a lot of Kubernetes, you know, ten years ago. Education is important. People need to, you know, learn from the bottom up and then have support from top down from their management. So we'd say, you know, come to the trade shows, come to things like KubeCon, come to look at the CNCF, uh, Have webinars and get more information and learn about what's happening in the space. And definitely come to ControlTheory and learn more about what we're doing. And we'd love to work with you. We have free trials and free demos and, um, you know, we can help. You know, help you through that process. That's kind of where we are. We see this transition happening now where there's a lot of education required. Um, and there's some complexity that people need to work through. And we're we're here to help. So. Yeah. It totally makes sense to me that once you, you know, maybe if you've only got, you know, six microservices, this is a this is a bit much. But once you start to put together, uh, you know, a couple developers even and a couple of different environments, and you start to stitch together a, an app that's spanning a bunch of bunch of different domains. It gets complex really fast in terms of troubleshooting. Yeah. And right away, uh, even if your volume isn't petabytes of data, just the complexity would be would be a motivator. Just turning on Kubernetes monitoring, within these environments can flood your you know, it's amazing. So, um, let alone what runs on top of it and all your application data. So it's, you know, it's a non-trivial task. And, um, we kind of can see how we got to where we are today. But I think there's a new phase that's happening. And it kind of we're building up, uh, the intelligence to make it more achievable in the future. So awesome. Awesome. Well, thank you, Bob, for being here today and taking the time to explain that to us. Uh, you know, I look forward to seeing what someone can actually see with a better observability. Uh, I do think we're going to, you know, even six months from now, hear some great stories of what people have now been able to, uh, evolve and, uh, just, just say, like, hey, you know, we got we got there now this unlocked this, and we were able to do some cool stuff. So thank you for coming today, Bob. Yeah. Thanks, Mike. All right. Take care folks. ControlTheory .com.