View the video here: Anaconda for Python
Mike Matchett: Hi I'm Mike matched with Small World Big Data and we're about to talk with Matthew Lodge, the Senior V.P. of Product for Anaconda. That's the big snake of course. An anaconda in this world means, in the big data space, Python. Python big snake eating its tail making the whole world revolve around lifecycle stuff. Why do I need a python? A whole kit and utility platform even? We're going to find out because Python is open source. I should just be able to download it with whatever I have in fact it even already has installed on my Mac Book so let's talk to Matthew. Welcome Matthew today.
Matthew Lodge: Great thanks very much for having me only on the show.
Mike Matchett: So maybe we could start a little bit with. Why would an enterprise need Anaconda? Why do we need a whole big thing for python?
Matthew Lodge: Yeah that's a great question. So the founders of Anaconda you know built a whole new package manager for python. You might say Python already has a package manager as PIP. You have PIP install something and you know what's the problem? Now in data science will the data science was a lot more complicated than your regular python packages and particularly because there's alot of mixed languages so you have bits within C and C++ and as soon as you get into that area. Well TensorFlow is a really great example. Think about the machine learning life from Google. So the instructions for installing Tensor Flow on Linux runs about 4 pages. You have to have root, you need to compile some C++ source code in the middle and if you managed to go through that's the end of that you'll have Tensor Flow. About 40 hours later you'll have Tensor Flow. And you'll see people high-fiving each other on Twitter because they got Tensor Flow to work. So instead to say content is Toltecs flow and there it is. And we've compiled it for your particular platform and that's what you get. And so that's why Anaconda distribution today up to about 7 million users.
Mike Matchett: 7 million, and I think that your download rate which is also up in the millions right. Yes. Got me correct. Incredible number.
Matthew Lodge: Yeah. Two and a half million downloads a month. Yeah.
Mike Matchett: All right. And I downloaded it. I even have the sticker on the back of my laptop I can't show you here on an Anaconda because it's a great python environment. Everything is packed for python. One of the values that sort of comes out of you guys having packaged up everything and having it be easy as a python install is if I'm in an I.T. enterprise shop I can create my own repositories right. I can go through that and say I just want to include these things because they're licensed properly for enterprise and then only let my data scientists work in that sandbox right.
Matthew Lodge: That's right yes. You can essentially customize your own distribution you can have whatever versions you want you can eliminate libraries that have taken licenses maybe you don't like GPL and have any of those. So we let you do all of that. And the advantage for the IT organization is they get control and you know they can audit that whole process. But for the data scientist nothing changes they just do the Anaconda install and now you're just installing from a local repository instead of the public one we provide on the internet.
Mike Matchett: It's no doubt that Python is the language of data scientists today. And you also have R in there though for those people who are still statisticians and like to do things a difficult way and I like to say. But Python's the core thing. Tell me a little bit more now about Anaconda enterprise right. So you've got this you've got this common download two million people a month are downloading this thing but now you have enterprise what is enterprise do for somebody?
Matthew Lodge: Anaconda enterprise is designed to solve the problem of how do you support a large population of data scientists in a typical enterprise environment? So we have this open source product. Anyone can use it on their own laptop and it all makes it make it very easy for them to do their data science in python or R. When you have a thousand data scientists or hundreds of data scientists inside your organization you have a different set of problems. You want those data scientists to collaborate with each other. You know we already talked about having a private repository for compliance reasons but then also none of this data science does you any good if you can't put it into production. Right. So going from I developed my program or trained my machine learning model. I'm confident that it works and it's making good predictions. Now how do I go from that to something that is part of how the company does business and so in the IT world very familiar how to do that for software in general. But now you need to be able to do it for python and R models and take those and put those into production. And your average data scientist is not a software developer. They don't think of themselves that way they're in the writing code in order to build a model but you know they're not devops experts that have pipelines, all of that. You know dev test, moving things in the pipeline is not what they do. And so essentially Anaconda enterprise also makes that step so it enables you to go from a model into production basically by hitting one button & under the covers while we're doing is using Docker, Kubernetes, container based technology to essentially build out that application as a as a fleet of containers that runs on top of Kubernetes. So now that is something that can run in production and serve predictions perhaps and be integrated with other software deployed as an API as an end point you can call it from all the software.
Mike Matchett: Right, so you know we were talking and you've got libraries in there to make use of GPUs that you could do with your partner NVIDIA, DASK for parallel threading on the fact that deploys on Kubernetes should give a hint to people that now this makes these things deployable into cloudlike architectures hybrid clouds and cloud. Right, you can target lots of different things so you've taken that data scientist from somebody who just takes a building model throws it over the wall to someone who could actually know manipulate it as it's going forward. In a couple of steps right.
Matthew Lodge: Yeah I mean the big difference between data science and maybe traditional software is you know you think about traditional software it's fully defined by the source code that you have in version control. You got code and configuration sitting in version control that you run and everything's fine. But in the data science world you know you may have a model that's running in production that's learning all the time and being retrained.
Mike Matchett: It's changing and your data is changing as well.
Matthew Lodge: And so it's a more complex pipelines more complex workflow. There are now artifacts being generated by that trained model in production that you want to capture motion control for example. So that you could really reproduce that model if the regulator. We have customers who do things like make lending decisions using machine learning and regulators care about things like fairness in lending regulations and so they might want to see how the model makes decisions that might make that decision back in June of last year you must be able to reproduce that model.
Mike Matchett: Now what I like is we go in in just a few minutes from a conversation about how do I install a language on my laptop faster to how do I actually take advantage of all my big data in an environment where I don't want to manage infrastructure right. And that's your bridging that whole thing. Right. So the conversation starts down year but then quickly you're like wait a minute the real value here is that I can now deal with my intelligence as a machine learning model. I can build the model, train the model, deploy the model. My focus is on the model but I don't even look at the containers. I don't look at the virtualization machines underneath that I don't look at the infrastructure stack going down down down down on that that's over here. I can focus on the model all the way through.
Matthew Lodge: Right. Yes. Data science is going to focus on that model and you let the IT folks manage the infrastructure. They're in charge of that pipeline and all those other things.
Mike Matchett: Yeah. So tell me about how some people have really been able to make use of this enterprise level functionality. What are they able to get to and how they've been able to see some real benefits from this?
Matthew Lodge: Yeah lots of different examples. Good one is you know Citibank is a customer advisor also an investor in the company now through Citi Ventures and so they're doing things like anti money laundering. So they're building these models not just any credit card decisions. So you know if you're approaching the limit of your credit card should Citi increase your limit or should they decline the transaction if you're about to go over? And so they're building all of those in python using things like tensor flow XG boost and these different shielding frameworks and then deploying them into production using the data that they have in their data lake. They've got big Hadoop appointments yet thousands of nodes lots of data so they're able to essentially use the data make use of that and build these models in python that are you know rapidly change over time. And so the data scientists can maintain the models. As things change as they train them and they improve the predictive capability but they can fully integrate it with a big data infrastructure that they run today.
Mike Matchett: Which is which is a challenge as we talked to other people who just built a data lake and not getting use out of it. Part of the reason is not because data lake doesn't have useful data in it it's because it's the other half of this is building analytics that can make use of it and getting them deployed yet.
Matthew Lodge: And that was traditionally been Java centric. So we've seen we've had customers say well in the past what we do is we take Python we translate it into Java or Scala and then that will be the thing when we deploy because we can deploy we know how to deploy Java although we don't know how to deploy python. And so you know when you need to keep iterating on the model the model's learning all the time that's not viable anymore. You can't do that. So this is new for data science. Look the python or R model is a thing that is a living breathing thing that needs to run in production.
Mike Matchett: That's great because I tried learning Scala once myself and I like you know functional programming I get the idea. But you know I don't think I could translate machine learning models into it anytime soon, that'd be a big shift. Obviously we can go to the website we can find download for the Python community edition and the Anaconda package it's all great. I've done it. What should someone do though if they're more on the IT side looking for more information about the enterprise stuff?
Matthew Lodge: So we have a webinar series where we often cover more of the enterprise I.T. topics. So those are all recorded that you watch any of the replays, we have new ones all the time. That's a really good place to start. And then on the Web site you also have more information for I.T. folks.
Mike Matchett: Awesome awesome. Well I think that's all we have time for today. I mean there's a ton more to talk about because you guys are coming up with new stuff all the time and I know there's just some exciting things but this is a big shift so I think this is a good good thing for people to consider and look at. Thank you for being here today Matthew.
Matthew Lodge: Great thank you for having me on.
Mike Matchett: Thanks Anaconda. Thank you guys for watching and I'm sure we're going to have some more coverage of what's going on in the data science world coming up soon. Thanks.
Matthew Lodge: Thank you.