Catch the full video here: Take the Hammer to Object & NAS Storage
Mike Matchett: Hi I'm Mike Matchett from Small World Big Data. We're going to talk today about oh this evolution in big data. What does it mean to make data as a service and how does somebody really take advantage of distributed data today? Hybrid clouds putting things in the cloud putting these on premise and making it all work and look like one global namespace to make it easier for customers. It's pretty complicated stuff. I've got David Flynn here today, he's the CEO of Hammerspace, Hammerspace just launched. Welcome to the show David.
David Flynn: Glad to be here Mike.
Mike Matchett: So you were at Fusion-IO and there I think you were telling me the SAN sucks. So we want to stick stuff in the server. And now you're at Hammerspace and that message has come forward to NAS and you're telling me that NAS today kinda sucks and we want something new with it. So what are you doing in Hammerspace?
David Flynne: Yeah yeah. Well it's data as a service data as a service but the way to think of it is this: In the SAN world because that's you know blocks. You put a file system on it at the client and the client is what adds the metadata. Metadata is important because that's how we perceive the data we perceive the data as what it's called in the directory, who has access to it. The data about the data is what we work with and think of as the data. So this file system in the SAN world is on the client that allows the storage to be very fast. In a NAS world the file system is down in the storage on the shared storage and that allows many people to share it and see that same metadata and talk about that as the data there. Well what we're doing it is to take that file system, the meta data, our view of what the data is and serve it from a separate control plane. So data as a service is made possible by the separation of metadata from the actual strings of bytes of data that are distributed in whatever storage at that point. So by separating the metadata and replicating that the data about the data everywhere, you can have the view of your data and where the data gets realized, where instances of it where copies of it get put around the environment, becomes something that can be automated and transparent.
David Flynn: And data has gravity so it's not like we want to really toss the data around everywhere every time we want to analyze what's in it and where it's at things. We don't make millions of copies of it for sure right. It's best that the data is going to live where it's used. So that's it that's the fundamental problem with having the meta data and having the meta-data coupled, is that every time you copy it it's now a new data item and it's like cutting off Hydra's head. You just have more data to manage. If you separate the metadata and serve it from a service as a separate control plane, then when you move instances of the data around you still only have one data item even if underneath the covers there are lots of replicas of it. As a consumer of the data you see one and the system automatically presents you with one of those copies as you need it.
Mike Matchett: I'm going to give you credit for something you probably don't take credit for. That is I think the original premise of software defined storage with separating the control plane from the data plane and the management from the data plane truly separating them and you guys have actually done that but you don't talk about it that way. But the really unlocks lots of different values. You can now layer services on top. One of the things we talked about was even just performance when I'm looking through the metadata to find files or I want to search things or apply policies to it, I don't have to crawl through the data and get in the way of people actually consuming the data right?
David Flynn: So it's like you don't have to wait for an elephant to be eaten before you can get at the little sugarcube. Metadata is very high value density, it needs to be served with ultra low latency. So having it separated means you're not having to wait for big packet traffic to the file IO. And it's inherently faster and that's the cool part about this. One of the other things I'd like to point out, if I may, is that with the separation of metadata from the data, we can actually focus on what really matters, and that's how to enable the data consumer to find what he needs to work with those things that he needs. And that's where having tags, labels, keywords, attributes having rich enhanced metadata and the tools to do slicing and dicing, bucketing of it, dynamic directories or catalogs and collections summations; ways to analyze the data through sophisticated metadata aids in the agility, from a data consumer point of view. So this isn't ultimately about the storage anymore. This is about cloud-ifying the world shared of file data by making it so that you don't have to worry about the infrastructure. If cloudification has sort of one high level concept, it's the separation of the infrastructure concerns from the user concerns. And in the world of shared file that hasn't been done. And this is what makes it possible is presenting your world of data through a service that's metadata driven and separated from where the data actually resides. And now the data can not only reside in different types of storage systems but even at different sites and at the same time yet still be the same one piece of data.
Mike Matchett: So let's talk about this just just to give a practical example. We're talking about if I have a NAS on premise, I have some cloud storage and that cloud storage could be file and/or object right? And I might have some edge storage, your camera space solution now is going to provide a global namespace across all of that, right?
David Flynn: Your data exists in an enhanced file system like view; something that could be mounted as a file system or even talk to as an object store. You can browse it through your Explorer or file finder. You can use it like a file system; your applications can mount and use it like file across all of those sites. And the data itself can exist on high performance flash, maybe even in the server itself. All the way back to cheap and deep object storage and S3 and all of that. This truly separates where does the data physically reside. From the view of the data, ie the metadata presentation of it as a as a mountable navigable file system and by the way with snapshots. So you have the time history evolution of that data namespace. So a Hammerspace is a place where data exists fully abstracted from the infrastructure. It represents the cloud-ification of unstructured shared data in a way that object storage hasn't given you because it's a total compromise on performance. It assumes no performance really; that shared file "the NAS-world" hasn't given us and they obviously can't do it with block storage because it doesn't even have the metadata.
Mike Matchett: So this all sounds great but there's another kind of trick that you must be doing and that is if I can access this Hammerspace from wherever I'm at, but the data is not actually here yet. What's what's going on to bring me to the data once I do decide what I want to get access to data?
David Flynn: So the answer there is you have to have both models of an on-demand model where the mere act of trying to read something or trying to access something causes it to get materialized nearby so that you can access it. And then maybe even if you start accessing it with a lot of IO intensity that it promotes it into something you know very high performance. So you can think of that as a spectrum of promotion, of demotion based on using it; promoting it to the right data center and then promoting it to the right performance level within the data center. And the beauty of our technology is we can do that transparently. We can move data from one storage container to another without disrupting the access. You can even be generating a file when it decides to move it and put it somewhere else and it doesn't interrupt the IO stream doesn't even induce a barrier or a flush. We call it data virtualization which forms the foundation of being able to hide the infrastructure; is that it can rebalance things across the infrastructure without changing the perceived continuity of access. It has to be there.
David Flynn: And likewise if I think about it the other way around, if all I'm doing is looking at the metadata which can give me a lot of value because an awful lot of information about the data that's really just what I want.
Mike Matchett: We're really thinking about metadata. With all the tagging and other things that we could add to this don't make it really rich unlike some object stores that we know and love...
Mike Matchett: Or the file systems of yesteryear even though you don't have tags.
If I'm doing that, I'm not impacting it I'm not actually moving the data. The data today stays where it is.
David Flynn: The beauty of metadata as such high value density you can replicate it aggressively everywhere and the data itself can follow on demand or here's where the system ties together. The second model is proactive based on policy statements of intent. And that itself is metadata, and our platform has that data about the data that says how you might want it proactively aligned across the infrastructure so that it can be optimally used. That we call the service level objectives. Those objectives are derived, are part of this enhanced metadata.
Mike Matchett: We don't have a whole lot more time; we started talking about metadata as a service.
David Flynn: We think of metadata as the data. Especially if the data just arrives when you go to open it.
Mike Matchett: So yeah it's like there's meta-data on stuff and I actually do think we have some time about that. You're telling me there's some detailed telemetry on everything going on which creates metadata even about accessing the amount of data in some ways we won't go into at all. But what are you doing with the what are you doing with that because that's some cool stuff too?
David Flynn: Think of this fundamentally as a as a meta data driven, separated control plane that allows us to have the traditional file system metadata, enhanced metadata from the user and operator, or even harvested metadata using third party applications. One of the things that we just recently demonstrated was putting Hammerspace next to your existing NAS environment; we replicate it into the cloud and within the cloud it automatically runs analytics like image recognition, you may see compliance information and the resulting metadata gets put back into the Hammerspace. And you can use it to predicate decisions about data sovereignty, immutability, whatever you want. Those things can be articulated as a mapping from this enhanced metadata down into the objectives, the service levels, the controls. Think of it as the metadata is put in a feedback control loop where it's able to manage the data across the hybrid crowd automatically. So but what you're talking about is another source of metadata and that's the telemetry that says "what's accessed when; what kind of performance was seen." And in our world every last IO is accounted for and how much it affected performance, how long it took, how many bytes were transferred. All of that information, at least in aggregate form, at varying levels of granularity, get recorded down to the individual data object being accessed and the clients lets users accessing it. That information about the data, that metadata, is extremely valuable as well. So what we're seeing is that for cloud, for unstructured data, to truly be manageable within the hybrid cloud, we have to separate metadata and have it be a separated platform that has enough power to subjugate the data and where the data gets substantiated and how the data gets presented as a file system. And put all of that so that in the end, you the data user, can have self-service control and can have concerns about the infrastructure dissolve in the background.
Mike Matchett: I mean you're basically making a storage brain here and that you no longer have to care about where the data is. You've opened so many boxes here today that I think we talk just like everyone don't want to dive into another 10 minute session.
David Flynn: We haven't talked about the machine learning and AI part.
Mike Matchett: We haven't talked about the metadata to say what is wanted and it's something else to have the system know how to lay out the data across the infrastructure to meet that. And that too is a really interesting topic. So maybe we'll have to get back together again. Just quickly if I want to look at Hammerspace to get more information?
David Flynn: Hammerspace.com it's all one word.
Mike Matchett: Ok. Look this up and everyone should because this is actually some crazy new cool stuff that's likely the future storage right.
David Flynn: I think we're going to look back five 10 years and go How on earth did I manage data copying it around and moving it around between these silos? We should be able to think of data as having an existence in a Hammerspace that is not really coupled to where the bits physically reside.
Mike Matchett: I mean the whole promise of virtualization and cloud right there. Thank you David for being on the show today. My pleasure. All right. Thank you for watching. There's so many more things to dive into especially about this idea and some of the ideas we brought in here too I'm sure going to have David back and talk about Hammerspace some more.