Transcript
Hi Mike Matchett with Small World Big Data. We're here talking again about my favorite thing, which is big data and how we might manage it and what we do, uh, with, uh, petabytes and petabytes of data in this day of HPC like workloads, especially with the onslaught of AI, uh, more and more folks are keeping more and more data. How do we keep that going around the world with our enterprises and our organizations? I've got Arcitecta here today. I've got Jason, and we're going to talk in just a second about, uh, what they're up to. Hold on. Hey, Jason, welcome back to our show. Thank you. Mike, it's really good to be here. Uh, so there's, you know, in one hand, you know, we just keep watching this increasing use of workloads, even by enterprises that are looking more and more like HPC workloads, their data sets are getting bigger. The things people are trying to do with them start to multiply. Uh, and whether we're talking about AI or just even, uh, getting a handle on where their data is, even if it's, you know, database data, uh, or old school workloads, uh, the data sets are getting larger. Uh, what are you seeing? Uh, just sort of the broadest trends in the market today. Well, they it's a classic thing to say, but the amount of and the size of things are continuing to grow. Uh, in the last week, I ran into somebody that has a individual files that are a petabyte in size per file. That's pretty incredible. Per file. And, uh, we are seeing our customers, uh, getting into the range of hundreds of terabytes per file. Got customers that are getting up into the, uh, trillions of files to be managed. So that's, uh, that's a massive amount of data. I've also, uh, uh, just come back from, uh, super compute. And into that building were 7.2TB of, uh, networking over five days. They'd moved 121PB of data. That's an enormous amount. The research community globally is looking to move a lot of data around the Square Kilometer Array in Western Australia, uh, is going to be shipping out a petabyte of data around the globe each and every day. So we've got we've got a couple of things going on here. The amount of data that's, uh, people are producing is exploding, but we also need to move it around the globe to where people are located, both in commercially and in research. So taking the data to the talent, if you like. Right. So we're going to talk about this a little bit more because we've got there's more to this story of where if we move the data or move the compute. Uh, but I just want to, you know, fall down this, uh, sort of path a little bit about, uh, the data growth and what folks are doing with it. And, you know, what what architect has been doing, uh, to sort of say, should we help manage the data where it is? Should we move the data? Uh, should we help folks grow their on premise data centers so that we help them take advantage of the cloud? Uh, where do you see that sort of going today with most of your users? Okay. There are two sides to this. One is that you've got, uh, uh, systems where you need to store potentially hundreds of petabytes of data. And in a traditional high performance compute environment, a lot of that data would have been stored in scratch storage. But we're aiming to do is reduce the size of scratch storage and multiplex on and off, uh, larger research archives or data archives, uh, in, in into and off your high performance compute. So that's on site in a data center, but equally. Well, uh, as we've got very large amounts of data being produced, you sometimes need to leave the data in place and have that geographically distributed. And in that case, you need to take the compute to the data. So leave the data there. So if you need to analyze what you're what's in your data, you want to actually distribute that analysis and compute, uh, to, to to the location of the data without moving the data. So we're seeing both of those patterns. You need to be able to scale, uh, into very large systems. And to do that cost effectively. And traditional file systems and storage is not as cost effective as it could be. And then you've got the other side paradigm with this, where you've got global file systems, where we're taking the data to where the people are working. So follow the sun workflows or people working in distributed environments, or the acquisition of data in a remote location and being able to analyze that and compute. So we're bringing all of that together under the one platform. Yeah. So let's talk a little bit about architected. You know, what you specialized in and what folks originally might have known you for. You've been doing this for what, 20 years? Uh, was data management and data migration. We might talk about, you know, like, how do I efficiently get the data to where it needs to be and how do I do all those other good things to data at that meta level? How do I make sure it's secure? It's how do I make sure it's tracked and, you know, life cycle kinds of considerations. Uh, but just from what you're talking about, you're really starting to push down into what might be traditionally the storage domain a little bit. How did how did you how did you find that to be necessary? Okay, so we've started in the space of managing data in independent of the way in which it's stored, so that we're traditionally a data management company. So that layer that sits above our storage. Also we've got the capacity to move data at speed over Wan. So with Wan optimization so often you need you've got data. You need to take it somewhere to share it or for compute or other. It's all part of a larger ecosystem around a broad definition of data management, which includes the acquisition, the transmission, the preservation, the transformation, the conversion from one format to another, and multi-protocol access. We've started at the top and layered ourselves over other people's storage. But as the amount of data is growing, uh, exploding, those a lot of those storage systems are very expensive. They're very good at what they do. They've been around for a while, but they're also very expensive. So if we sit above these and hide them, then we have the capacity to actually push ourselves further down the stack and take over the storage. So what we've done recently is to be able to drive block storage directly and remove the need for a traditional clustered file system underneath us. So that gives us the overall ecosystem of managing data or data management, all the things that we do with a very cost effective solution for very large amounts of data that are either local in a given data center or rack or data center, or geographically distributed globally. Yeah, there's a couple of ways my mind is trying to absorb this. So, you know, if we had if we looked at a vendor solution who sells like a clustered file system that's really, you know, high performance oriented, uh, they're adding capabilities, but they're really adding them to a system that's anchored somewhere. And, and what you've done with the data management layer that you know, is more global in orientation is brought some of those capabilities down into that storage more directly, meaning you don't need to have those features necessarily implemented in each point location. And in fact is probably not the optimal thing to be doing anymore. Uh, and, uh, you know, you're not, you know, from what I understand is you're not necessarily trying to say you don't, you know, to, to to a given customer. You don't you no longer have a need for an HPC parallel file system to run your supercomputing data center. You still need that, right? You still need those things, but you don't necessarily need all the capacity that surrounds it or the capacity expansions. Uh, and when it comes to like sharing that data, they should get off that platform and start to look to arquitecto. Right. So I think maybe one day someone will have the one file system that does HPC and every other workload. That's a very big ask. So we're not after replacing your parallel file system for your HPC or your high performance flash storage just for your high performance compute. But what we're trying to do is help people reduce the size of that, and then and then allow very large data archives off to the side of that. And, um, we'll multiplex data on and off that, uh, scratch storage. Also, we do find some people run their HPC workloads directly off us. So we present as Multi-protocol, NFS, SMB, S3, etc. and a lot of other protocols. And sometimes depending on the workload, those those pipelines run faster, typically because the CPUs in our machines are faster than the traditional storage. And we're talking to these to the storage underneath in parallel, but your mileage will vary. So a general pattern is to is to allow people to shrink their HPC and have these larger, uh, storage optimized, uh, storage systems or cost optimized storage systems, I should say, where we place the data at the right place at the right time, we make sure that it's genetically diverse in terms of storage and that it's and it's resilient that you can't delete things, uh, without if you without jumping through hoops, etc., and enable that data to be geographically distributed. Right. So if I, if I'm trying to add up here on my fingers, you've got, you've got the idea of doing data management across the entire data set. So it's unified unified management uh, in one place. Which makes sense, especially as I grow into petabytes and petabytes and distributed petabytes. I don't want to be doing this multiple times. Uh, you've got, uh, I don't call it tendrils, but you've got ways now to implement storage directly over block services. So you bring your file, your global file system, directly over block storage that someone can implement. I assume, you know, as you said, you know, it's a really efficient capacity play, uh, for, for this, in some cases performance is good enough for, uh, even if not for supercomputer workloads. There are a lot of work, a lot of workloads and work processes these days are getting more and more complex with more components to them themselves. And so they don't all need scratch storage, but they all could work better if they have a performance storage. Uh, so we've got that. And then you talked a little bit about, um, you just teased this a little bit putting compute down into the storage. So I caught that. So tell me a little bit about what you said there about pushing compute into the storage. Right. So when we're dealing with data, we've got an optimization problem. And there's a conundrum here. Do we take the data to the compute or do we take the compute to the data. Well the thing is you might want to do both depending on what compute you've got and what resources you've got for compute and where the data is. So I'll give a really simple example. So say I have a global file system with one lot of storage in Sydney in Australia and and one a lot of storage in Los Angeles in the US. You if you were to bring the data to the compute to compute checksums or check hashes and make sure that the data was still valid, you're going to move a lot of data across the Pacific. That's not right. So what we've got is the capacity to send an operation to the storage. And then it will do the hash or checksumming locally on that storage. That's a very simple example, but you can extend this to do, uh, metadata extraction, analytics, machine learning, etc., all distributed out to where the data exists. Now this is very efficient. It's very cheap relatively, to send the instruction set to the data and do the compute locally versus, uh, bringing the data to the compute. Now this is all about flexibility. If you've got enough compute power in the storage and where and if you put an appliance out there that's managing the storage has got, uh, hardware acceleration or just high performance CPUs in it. This is very viable and probably much cheaper than shipping the data across the globe. But if you have a that you've got an algorithm that specifically can only run in a compute, uh, in a computing center in on some continent, you may actually have to move the data to the compute. So we manage all of this. We cost optimize this for you. You say I want to achieve an outcome. We'll determine whether we take the the computation to the data or we bring the data to the compute. Wow. Really? You shouldn't have to worry about this. You know, hearing you say that, I mean, that's a thing that we've been talking about here for years to folks looking to see someone come up with that, as, you know, be the intelligent system for me. I tell you what I want to have happen. Here's the compute that I need to run. Here's the data I need. I need to access and you, the system, figure it out for me, and I can get out of the game of of trying to set all the little storage knobs and bells and whistles and buy all the little components and stitch it together. Right. Uh, I have a business solution I want to solve or a development application I want to run, or is some workload I need to get done. The system should solve that for me. Yeah. And this just didn't happen today. We've been working away at this for the last, uh, two decades, building the platform that enables us to have a, uh, effectively an operating system for data that is local and distributed. That's why I said we've started from the top down. We didn't want to invent the storage, but we've made our way down into the storage. And these distributed, um, storage, uh, ecosystems are now very easy for us to, to do. So we've built are we created and deployed and we've got lots of customers that are using the, the data management ecosystem that we have at the moment. And we're but it's really easy for us now to assemble these kind of patterns and play with the idea of, uh, where the location of compute and data is. Uh, so we don't have a whole lot more time here, Jason. But we didn't hardly talk about all the things your global file system could do. And I know there's a lot more of that. We didn't really, uh, get to talk about some of the things that you've been long doing with data management, but if someone wants to learn more about it, what would you have them go look at? What would you what would you send them? All right. Send an email to talk at architecture.com or come to our website, architecture.com. And I just encourage people to have a conversation with us to see what's possible. It's amazing things I'm hearing from you, and I'm looking forward to our next discussion about how we really take and make things like high speed, global, performant file systems into components of a larger architecture that's more intelligent about where our data is and starts to optimize that based on the, you know, basically priorities we have for cost and space and access and getting our workloads done. So let's have that infrastructure work for us. Right. Love to hear it. All right. No worries. Thanks, Mike. All right. Thank you. Uh, check it out. Architecture.com. Take care guys.