Transcript
Hi Mike Matchett with Small World Big Data and we are here today talking about of course my favorite topic AI and how do you scale it? How do you implement it? How do you get it working for you? Uh, lots of people are running into obviously challenges with their infrastructure, getting it big enough, getting it large enough to run these huge llms. And they're looking at Bragg. They're looking at ways to augment their llms with extra data. But even that starts to offer some scalability issues. We've got Kioxia today here talking about their vector database solution that they've made for us, open source, to help us all with that. Hold on a second and we'll get right into it. Go. Hey, welcome to our show, Rory. Hi. Great to be here. Uh, so tell tell us a little bit about, you know, what you're doing with this AI. How did you even start to do AI I things at what most people think of as a flash hardware company. What? How did you cross that? How did you cross that barrier there? Well that's great. I mean, a lot of people really wonder what is a SSD manufacturer? A flash manufacturer doing in AI. And why should we even listen to them? It turns out we've been involved in AI for many years now, starting back in 2017. We installed machine vision systems in our factories to inspect the wafers as they're coming down the lines. We developed a lot of the machine vision software ourselves. Our research institute has published many papers on AI, and in general, we see it as a huge opportunity because, as should be apparent by now, AI runs, lives, breathes data. And so anything that enhances the need for data and high speed storage is a good thing for us. All right. So let's let's talk a little bit about, uh, rag, uh, the retrieval augmented generation that people are doing, uh, with their LMS. Well, first of all, why, why and what does rag do for people that, uh, that they just can't get from, you know, ChatGPT directly? Yeah. So that's a great question. Uh, rag is actually one of the primary areas of of focus for me right now within Kioxia. Um, so these large language models that are in the news today, uh, they're really, really expensive to make and they take a long time, a lot of processing power. It can take months for the large language models, uh, to be distilled down into the form that, you know, you can use in your enterprise. Um, but they're trained on publicly available data. And as I said, they take months to months to train. And so by the time they're published, they're actually a bit out of date. And the wonderful thing about retrieval, augmented generation, or Rag, is that it allows you to supplement these large language models that have been created at great expense with your own private data, as well as up to the minute data, so you can really ground your results, get more timely and accurate results from your AI systems. Yeah. So rag is this idea that you take your own data and you chunk it up in some way, and you can feed those chunks relative to the prompt that's going on into the, into the more larger static LLM. And it augments the result. So it gets smarter because it gets specific data or current data or private data on there. Um, but that's still a bit of a mystery to people. It's a black box. How you. Rag involves, um, something that I've heard called vectorization. Maybe you could just explain what that is and why that becomes kind of a problem for people. Right? So, um, to really take advantage of Rag, you have to preprocess your own data, right? Your private data or your up to the minute data. And that involves creating these small pieces that are called embeddings. Or you refer to them as chunks. Uh, these embeddings are tiny snippets of your data and classifying all those little embeddings along several dimensions within a vector. Right. So those dimensions could be if you're talking about, uh, visual data, it could be color or shape or size or any other, uh, thing that describes that data. So you, you break your data up into little pieces, you quantize it along the dimensions that are of interest to you and form vectors. And then you insert those vectors into a vector database and create an index for that vector database. And then when someone wants to perform a query, The query is submitted to the vector database, which does the lookup of the best matching vectors, and brings back the relevant context to feed into the large language models and generate these, uh, augmented results. And so. So this vector database becomes a critical piece of infrastructure to host. What. What are some of the challenges with with doing that. Well, as you want, uh, more relevant and higher accuracy, uh, in in your results, you end up, uh, classifying your data, uh, along more accesses and generating more dimensions and, and in fact, generating more, uh, vectors themselves. And the scale of these vector databases is growing without an end in sight right now. So, uh, there are now, uh, deployments of vector databases where they have over a trillion, uh, vectors in the vector database, and each one of those vectors can have several hundred dimensions to it, describing that particular node in the in the vector space. So we're trying to get ahead of the LM and offload that or preload that right away. But now we've got a trillion row database that we now have to manage in front of that. And and that that doesn't sound feasible. Yeah. So you know, I should point out, you know, trillion nodes is a very, very large database. But it's not uncommon for the vectorization, particularly if you're going into multimodal data for the vectorization to explode, the size of the data set by a factor of anywhere between 2 and 10 x. So, um, just the actual, uh, demands on your storage and in particular, uh, on your memory subsystems to hold the indices for these vector databases is growing without end right now. Truly a challenge. And generally I understand that when you, when you, when you create these vector databases, the point is they're fast lookups. So you're holding them in memory. And and memory becomes this, this critical resource on here. And and you guys are a flash outfit. So, uh, tell me the story now of, of what you're developing to get this stuff out of memory into into flash. Well, first I'm going to do a call out and say, you know, we're standing on the shoulders of giants here. So, uh, a while back, Microsoft did some research into a technique called disk an or disk, approximate nearest neighbor search. And, uh, this technology was, uh, developed specifically to, to address, uh, the memory pressure problem and to move those vectors out of memory onto SSDs and then to reduce the indices by quantizing the vectors and reducing their size, uh, at the expense of some accuracy there. Uh, so they would fit more easily into Dram. But as I mentioned, when they quantize these vectors and reduce the accuracy, um, they didn't stop there. Uh, they actually developed some very innovative techniques to post-process and regain, uh, and generate very accurate results, even though they're not using the full precision vectors in their indices. So Microsoft was the first to do this. They created disc n, and then we at uh, embraced this technology and wanted to extend it. And so whereas Microsoft took the first step of reducing the memory footprint, we flattened the memory footprint. So, uh, as far as we can tell, uh, there are no limits to the scalability. Now, this serves multiple purposes, right? The first is, is that, you know, it makes it practical to have very, very large databases. But AI is proliferating. It is everywhere. And it's going into edge devices and handheld devices, which have very limited memory space. And so it's very applicable at the small end of the scale too. All right. And I understand your solution. We're calling verbally Isaac. But it's spelled. How do you spell this. A I s a q all in storage approximate nearest neighbor quantized search. And there will be a test later for you folks. Okay. When you're watching, get that down. Uh, but, Isaac, um, and this is open source that you guys have built and are offering to the community. Right. Uh, and so just if I'm so if I take my vector database and I start to get it out of Dram and I can put it on a sensibly SSD or, you know, flash and get it, get it there. Uh, I think I can gain a lot of benefits. I get a little bit of lag. It's going to be a little slower, I would assume. Um, but correct me if I'm wrong. Uh, but what are some of the things I can gain if I'm, if I'm, you know, sitting there trying to say, to say, I got to build an AI cluster, or I'm trying to rent a whole bunch of resources in the cloud. What? What? What do I what do I really benefit from doing this? Well, the first thing is there's a bit of magic behind what Microsoft originally did. Um, their disk based solutions can actually outperform the in-memory vector databases at high levels of accuracy on the query. So, um, the way they did that once again is by quantizing the vectors, their lower precision vectors, uh, easier to perform arithmetic on. And so, um, it's just less work for them to do the initial part of the search before they actually have to go to the SSD and retrieve the full precision vectors and do the final work. So at higher levels of recall, accuracy can actually crosses over and outperforms the memory based solutions. So that's an interesting aside. Uh, and then we built upon that even further. Okay. So you've got you've got some bigger advantages. Uh, and, and I understand that, um, you know, the more we can get out of memory, just as a former capacity planner, obviously, the less memory, the less critical and limited resource I need. Uh, but there's some clustering benefits here, too, right? Because if it's on disk, uh, it looks like I could share this database rather than have multiple copies of it, one in each node. Right. So, so the benefits of reducing that memory footprint and flattening it out are manifold. So you can, in a multi-tenant environment, run more instances of vector databases on the same machine because they each take far less memory. Right. That's that's one benefit. Another benefit is if you're doing any sort of of, uh, on demand, uh, processing in a vector database, um, you don't have to preload the index before you can do a query because the index is actually out on the SSD. And so the time to first response is greatly shortened in in the Isaac solution. And then as you alluded to, if you have a network storage environment or shared storage environment where these vector databases and the vector indices are all out on the storage. If you're trying to scale out your service, you can provision a new node, attach it logically to that shared storage space, and without having to do any sort of initialization of the database, you're up and running so you can easily respond quickly to burst demands in your environment. And you and you had mentioned, uh, something about I moving towards the edge, which we think is a really big trend. That's where people are going to want to apply AI, where the business is happening. Uh, and so there's some advantage here also, because you don't have to have terabytes of Dram in your edge devices. That's right. You know, I mentioned earlier about these large vector databases, you know, using the using the traditional approaches, when you get up to about 100 billion vectors in your vector database, it requires hundreds of terabytes of Dram for the indices. Um, you know, Microsoft flattened that down to only requiring tens of terabytes. But, you know, with a lot. Yeah, that's still a lot. With Isaac, you can get down to a flat footprint of about 200GB of Dram, regardless of the size of the vector database. Yeah, yeah. And not not to put you guys down, but flash is relatively cheap, uh, you know, these days, right. And getting cheaper. Thanks. Thanks. Uh, but, uh, you know, putting ten terabytes of dram on a on an edge node would be still pretty expensive. So, uh. Yeah, I don't think your next, uh, cell phone is going to have terabytes of Dram in it. No. Oh, man. You should see how many windows I have open. I need that, but anyway, uh, we are looking at pretty, you know, it's pretty significant kind of advantage to having, uh, on disk vector database for Rag that is not, um, that doesn't lag in performance. It can actually deliver better performance, right? It seems like this is a good technology. So you're doing this in open source, which is which is an unusual thing. We talked about that if someone wants to find out more about this, or maybe get their hands on that open source or kick the tires on it, what would you what would you point them at? So the America, uh, GitHub repository is where you'll find all the open source projects and disk, and Isaac is listed prominently there. We don't have a web landing page for this project yet. Uh, so I would just suggest going to the GitHub repository and monitoring it for updated information. Um, and eventually I'm sure we'll have a landing page for this. Sure. And, uh, you can go to Kioxia.com and look for information about the flash solutions. You really do. You really do make money selling on to start with. So, uh. That's great. Uh, and I know there's more coming. You're sort of, you know, offline, you're sort of hinting at stuff for me. So we're looking forward for you to come back and tell us about what's coming up next on this and what what more you're releasing. I'm just super excited to hear about this kind of advance coming from a sector of the market. I didn't exactly expect it to come from. Right? That's right. You're not selling. You're not selling GPUs. You're selling the flash. It's great to hear great to hear everyone contributing here. So thank you for being here today, Rory. My pleasure. All right. And check it out. Uh, you know, if you're in the AI space, are you trying to build an AI solution? Here is an on disc, uh, out of memory, scalable vector database for your Rag solutions, which you all need to do. Right. That's how you keep your your llms relevant and current. So take care, folks.