Reining in Data Sprawl: A Live Demo with Chris Dearden of Komprise

Every year we create more data and delete less of it. Some estimates have the growth rate at 42% per year - effectively doubling every 2 years, resulting in data sprawl that can quickly outpace the budget to manage it. How do we break out of this cycle?

Chris Dearden of Komprise shows us how we can quickly and thoroughly analyze large sets of data, get the insight needed to create and execute an intelligent durable plan to move and store the data in real time.

Video Excerpts

Providing Visibility

Building a Data Management Plan

Using Deep Analytics

Full Transcript

Lee Razo:

Welcome, Chris. It's good to have you back. We've had a couple of sessions, chalk and talk webinar sessions on the Komprise solution of data management.

And I wanted to record this follow-up session with you because I think that this would a valuable resource for people to refer to on a regular basis, where you actually show us how the Komprise software platform works and how it addresses some of the issues.

Maybe you could just give, like, a really quick recap of the high-level points from our sessions, and then show us the Komprise platform addresses them, and we'll take it from there.

Chris Dearden:

Right. Thanks, Lee. So in the previous couple of sessions, we talked about the initial problem of storage growing and outgrowing the budget to manage and to keep expanding that story.

And it's growing at really quite a rate, you know.

We're seeing forty percent year on year.

So that's - you know, you're looking at your data doubling every couple of years or so.

And we do that because we're creating more data, we're deleting less of it.

How could you manage this without just always buying brand new storage? How can you break out of that cycle? And we spoke a bit about how you need analyse that data.

Once you've analysed it, then you could make some decisions about, you know, where to best place that data.

And if you could actually make those decisions actionable and then move the data according to a policy, then you do have a way of saving that data.

Lee Razo:

That's great, Chris. Can you show us a little bit how that actually works in real time?

Chris Dearden:

Absolutely. Take you through offering through the platform and how we do what we do, and a little bit about why we do it as well.

So I'm actually going to start a little bit behind the scenes as to what you actually need to deploy Komprise.

The screen you're looking at, at the moment is our director.

And this is the control plane.

This is where all of the information is surfaced, but it's purely just the control plane that sits in the cloud.

What you're going to need to do is to deploy some virtual machines, some appliances which are going to sit close to your storage and your data centre.

And they do all the hard work, and we call those the Observers.

They're very straightforward to deploy from a virtual client's file.

Just download it.

Put it into your hypervisor of choice, and then give it some network information.

They'll connect to the director and essentially carry out any of the instructions that we issue from the interface. Once we've got those set up, we need to add our source storage. And Komprise can talk to pretty much anything that is serving out SNB or NFS, so anything that you can be putting files on.

After the box, we are set up to talk to most of the common platforms that you might see, whether that's various types of Windows server or some of the more dedicated file servers - file platforms like NetApp or Dell AMC [inaudible 00:03:23].

Once we've added that storage in, discover and look for the shares that we're interested in.

Once we've chosen the shares that we'd like to look at, we will enable those. Once that share's been enabled, it's going to send the message to the Observer.

And the Observer is going to start to crawl to file share.

So it's going to work its way through all of the files in the directories, and we're going to be looking specifically for that file system metadata.

So we don't need to open the files up.

We're just looking to find out what the file system knows about them.

How big is the file? Who owns it? Most importantly, when was it last accessed? And we're going to collect that data together and we're going to aggregate it and summarise it at a - to share level, and that's going to give us this picture of how hot or indeed cold is our data. And we represent that in the plan screen.

Lee Razo:

So this actually will work with your on-prem and your cloud data storage.

Chris Dearden:

Yes. If it's accessible. We do have a platform which is designed specifically to look at object storage.

That does a very similar thing. But the Observer is going to crawl through the object storage bucket and look at the metadata around the objects contained in it.

So once we've collected all that data together, we're going to summarise it on this screen here.

And we summarise it in the graph that we refer to internally as the "data doughnut".

And this just really shows us how hot, you know, the percentage volume of data across the entire estate, across all of those enabled shares from different platforms, how much of it is hot and how much of it is cold.

Our standard setup is that anything that is blue hasn't been touched for a year or more. Now you can see in this particular little demo environment that just over half of my data hasn't been touched for over a year.

And that's quite conservative compared to what we would see in the field, where frequently 60 to 70% of data hasn't been touched in a year or more.

And if it's not been used, then perhaps there's somewhere better that you can put it.

Taking, you know, a little bit more into the analysis and going just that little bit deeper, we can start to break down some of that data by the other bits of metadata that we've collected.

So we can look at the breakdown of files by type.

Have we got files that perhaps are in the wrong place? Have you got backups being kept in team drives? Have you got database file being kept in personal drives? The wrong type of data in the wrong place can be a very expensive way.

And if you're storing data that really should be sitting in tier two on tier one storage, then that's very expensive because not only have you got to provide the capacity for it, but you've got to be able to, you know, replicate and back that data up. So, you know, a small amount of data at the front end can actually require a fairly significant investment throughout the storage cycle.

Lee Razo:

Yeah, like one of the topics that you'd mentioned in your earlier sessions about ROT data - redundant, obsolete, trivial - and how we ended up having to buy storage at today's prices to store data that we may use years from now or maybe we used it years ago. And this is a really great way to get some insight on that.

Chris Dearden:

Absolutely, Lee. So it's a great way to start to uncover where that ROT data is sitting.

Another great example of the obsolete data is data from people who are no longer with the company.

Looking at the space by the top owners can really highlight that very quickly. It's very common in academic circles, where you might well have data that belongs to students who graduated five years ago.

That data is probably not really the sort of data you want sitting on your tier one storage.

So we can identify that very quickly.

Lee Razo:

Yeah. It's a really great overview.

Chris Dearden:

So taking some of these data, we can start to build up a little bit of a plan about what we want to do with it.

And in Komprise, our data management plan consists of a number of groups.

And within each group, it's going to be one or more file shares that will have the same policy applied to them.

So in that policy, we can do three things.

We could move data.

So I will move it to a target.

So the target is going to be another file platform or an object platform that could be on- prem, could be in the cloud.

And when we move that file to that target, we're going to replace with a dynamic link.

So as far as the end user is concerned, the file appears to be still there.

There's still a file with that name and that icon.

But when they open it, or if they do decide to open it, bearing in mind it's not going to be moved unless it was over a year old.

What if someone does decide to double-click on that file? It will open and it will open seamlessly without having to deploy any agents.

So the data is still there, it's still able to be used.

It's just not being used at the moment, so we're putting it somewhere cheaper.

Lee Razo:

Yeah. I think, actually, you just answered my next question. So there are - indeed no agents are needed for this. This is actually something that works in the background?

Chris Dearden:

That's right.

Because we use, sort of, file system-native constructs, so there's nothing proprietary.

We're using symbolic links.

We use some specific features of some of our partner files systems, like the NetApp, to do this so that an end user, you know, on their desktop or from their VDI session, they need to open up a file.

Or maybe even application needs to access a file.

Look at medical systems, pathology systems that create huge volumes of data.

You know, terabytes of data.

And that can just be for a small hospital.

And that data is extremely important for a relatively short period of time, and it needs to be accessed very quickly for that time.

But, ultimately, you can't afford to have the capacity of the performance tier to store everything, so you need to find a way of being able to move that data somewhere a little bit cheaper after a year - a period of time, but still be able to access it.

When a doctor needs to get that data, he's got to find the patient's records and he's going to want to open that scan that he did last year.

He's not going to want to wait for someone to go and get the data back from a tape or even, as I saw in one particular customer environment, for someone to go and switch the archiving server back on, which is the ultimate offline way of keeping the data.

Lee Razo:

Yeah, no, that's really great.

And I really - I think we can understate enough the importance of, like you were saying earlier, about no agents, no proprietary.

Also, you know, if a customer stops using Komprise, they still have access to their data just the same.

Chris Dearden:

Absolutely. Absolutely. And it's not just moving the data that we can do with Komprise.

We can actually take the decision that, you know what, this data is so old that, ultimately, we just want to get rid of it. It's already been backed up.

It's probably sitting maybe on some year-end tapes that are happily in a tape vault somewhere.

We can actually mark the data for deletion.

Now Komprise is not...

It's very, very risk-averse company when it comes to data loss.

So we don't actually delete any data. But what we do is we will move it into a hidden folder.

So as far as the end user's concerned, that data's gone.

The storage app would then go in and clear and purge that trash folder eventually once they're satisfied that the data really isn't needed.

And that's what we call our "confine policy".

So, again, we could set the policy to, to confine data that might be, in this case, over three years old.

So that last section of data, that's the data we just want to delete.

We don't need it.

And as we play around with these policies, the purple section inside the doughnut will change, and that will reflect how much data we could be moving to the target.

And as well as that moving, we like to help our customers build up the business case because doing this ultimately is to save some money.

Because you either don't have the budget, or you'd like to spend some of the budget that perhaps previously would have been on the automatic storage refreshes on something more interested, something that's actually going to bring a greater business value than just providing file storage.

And we do that with our cost model.

Lee Razo:

Just a question also. If I have some of this data, especially the archive data in a public cloud, how does this work with things like egress charges? You know, this is notoriously expensive to read back data, or you know, to download it out of the cloud. How do you access that?

Chris Dearden:

Absolutely. So when you're pulling data back, you know, egress is always going to be a consideration.

That's why we want to be careful with the amount of data that we put into the cloud.

Putting too much data into a cloud platform can actually cause...

I mean, not as many problems, but it can still - it can cause the same sort of level of problems as you might have by putting not enough and having to expand your on-prem storage.

Because you can get the surprise bills because you suddenly archive too much data, and you're users are now trying to pull it all back.

And your health provider is out buying themselves a brand new car because there can be a big shock.

So we model and we know, generally speaking, once data's gone to archive from a certain age.

How much of that it's going to be recalled? And it's surprisingly little.

You know, we're looking at sort of single digits or percentage of data that goes off, will never come back.

And we know approximately what the cost it.

And all of these fields are designed to be filled in by our customers and partners.

So this is running with their data.

In fact, our cloud platform, we pull this data in dynamically simply because the cloud providers have a costing API that, you know, they're very transparent about their costs.

I've yet to find an API for a data centre storage provider that's quite so transparent.

But when you're faced with the upgrade quite to buy another couple of shelves of your preferred storage, then you suddenly get a very acute - you become very acutely aware of how much your storage is costing to you per terabyte per year.

Lee Razo:

That's right. And it's sticker shock, right? Yeah.

Chris Dearden:

Absolutely. Absolutely. And whilst a storage friend is not going to suddenly give your money back because you're using slightly less of the storage that you've already paid for.

It's around that cost avoidance of maybe you don't have to expand it every 18 months. And by three years worth of storage at today's prices, you know, when you know that storage is going to be dropping over time.

Lee Razo:

Yeah. Yeah. That makes a lot of sense. So this is great.

Chris Dearden:

So once you've got all this set up, you would activate your plan.

And every seven days, Komprise is going to run the policies.

That's just a default figure.

You can make more or less frequent if you like.

But, remember, we're dealing with cold data.

This data's not really going anywhere.

So it doesn't matter if you it takes us little bit of time to move it.

Lee Razo:

Yeah. I've been there three years in this case.

Chris Dearden:

Actually. And this archiving type operation, this isn't a once-only thing.

This is not just a spring clearing, forget about it.

It's a continual thing.

So, you know, every week you're going to have data that's kind of aged out.

Once that data's aged up, we'll move it down to that tier.

So once your set up, it's pretty set-and-forget.

Lee Razo:

Yeah. Yeah, I mean, this is really a big challenge for most companies I've ever come in contact with or worked with, you know.

Not only just managing the amount of the data, but the - like you said, the kind of data that's in there.

The complexity, but also, you know, those - I think you mentioned P90X videos and stuff like that that's hiding in there.

So this is really a great overview.

Chris Dearden:

Absolutely. So talking about, sort of, the life cycle of a file server. Now the file server, it's born, it gets provisioned or maybe it gets purchased if it's physical hardware, and it's...

Day one, it's really happy.

It's got loads of capacity, loads of performance, it's really up to date.

And it gets filled up.

And because it's brand new, people just use it as a ground to put anything on, you know.

There's some temporary swing space needed for projects.

They're going to throw it into that server.

And over time the server gets full and it starts to slow down a little bit and maybe its support contract starts to come towards the end of its life.

And you decided that you want to free up some capacity, so you move the cold data form that server to another tier storage.

Now that server reaches the ends of its life. What do you want to do? With a lot of platforms, and with, even - if that was, sort of - had some native tiering, you know, there's plenty of storage providers that will automatically expand their cold data into the cloud.

Problem is, is when you want to buy a new one because how do you just move the hot data without having to pull data back? And that's something that Komprise is extremely good at. We've got our own migration engine built in.

And what this engine does is it will share-by-share allow us to move data between a old source and a new destination.

So we buy ourselves a brand new server and it's just going to move that hot data.

And once we've moved all the hot data, we'll update the symbolic links that we've created to point to the target.

So we do it without having to rehydrate any of the cold data.

Which means that migration is done quicker.

It's under more control.

We use a iterative technique, which allows us to start the migration way before the planned cut- over time.

So one of the big traditional problems with migrations is that you have a necessary window to do that cut-over in.

And there's this big rush to try and get everything done within that window.

Whereas with Komprise, we can start the migration weeks before.

And we get down to the stage where, actually, we know we're only copying one day's worth of data.

And once we get to that stage, we know exactly how long our cut-over's going to take because it's just that iteration of these are the files that have changed or were open during the last 24 hours, and that's just what we're going to be copying.

So this is lot of control and the ability to reduce the potential outage windows caused by migration because they can be pretty significant.

Lee Razo:

Yeah. Basically, it's serving as an abstraction layer, then, that's independent form the physical location.

So it gives a lot of flexibility here then, it looks like.

Chris Dearden:

Yeah. It just gives us the, you know - a lot of the migration techniques can be done with a lot of scripting and what I would refer to as, sort of, the Excel sheet of pain, trying to manage this migration of how far did we get the last time.

Or this migration step got 99% of the way through, then the network crashed, to we got to start again from scratch.

And that's quite an unpleasant thing to do. Whereas, managing it with Komprise, you know, it's all part of the platform.

This is an included feature in the product.

You don't have to buy extra modules to be able to do it.

So, you know, we think it really adds a lot of value.

And helps complete that life cycle, you know, of whether you're consolidating multiple servers.

Working with a caster at the moment who's been - you know, they've got maybe 10, 15 Windows file servers that are attached to a SAN, but it's probably about old enough to vote.

You know, they need to get the data off there and they want to consolidate that down to maybe to one file server.

They then use Komprise to move the data off first. And the little data that's left that's actually active and hot, we can move that straight away into a brand new server.

Lee Razo:

Yeah. Fantastic.

Chris Dearden:

The last thing I wanted to touch on the demo, you mentioned it earlier.

This was something we spoke about on the very first chalk talk, was around deep analytics.

And really seeing what makes up that ROT data.

The data doughnut graphs that we went into earlier, they're great for seeing the high-level picture.

And you can, you know, really see what the general feeling across your environment is, with a little bit specifics on the details.

But if you actually want to see what files are making up the data that we need to move out, this is where deep analytics comes on.

So instead of just, sort of, crawling and summarising data, it builds up an index and you end up with this - essentially a data lake of all of your file system metadata. And we can query that.

We can query it pretty much like you be choosing something off of Amazon.

By all of the various different types of metadata that we collect, you can build up queries and see them in these results, and actually see the contents of the file - sorry, see the contents of the metadata.

We're not seeing into the file. This is just what we know about it.

You know, here's its actual path. This is exactly what the file name was. How big it was, who owned it. And we can extract this out to CSVs either for use in reporting.

And perhaps in the future, we'll be able to use some of these complex queries to feed back into that data management plan, so then to be able to have - rather than just selecting things at a share level, and maybe with some slight fine-tuning, to be able to have an extremely granular data management plan.

Saying, "Actually, all of my engineering drawings, I want to put somewhere.

You know, I want to treat my drawings different to how I want to treat the rest of my data.

In fact, if it's not an engineering drawing and it's over seven years old, get rid of it.

If it's an engineering drawing, then we want to keep it."

Lee Razo:

I can see on the left here, you have even a place to put custom file extensions, so there's a very specific non-standard type of file that's specific to your company.

You can actually report on those as well.

Chris Dearden:

That's right, that's right. We also have the ability to add tags into the index.

So we're not modifying any of the source data.

But we can add additional metadata into the index, into the information we know about the file.

And those tags are a key value pair.

So you could, for example, have a tag - unless if you're an engineering company that builds projects around the world, so you might well have a tag with the location or a tag with the type of project to say it was a bridge in Dubai.

And you query for everything that was in that original project folder and you tag it "bridge", "location", "Dubai", "project type bridge".

And that means that someone can come along at another date and go, "I need to find all the drawings that relate to bridges." And then you have the project that will do that.

They'll be able to bring up a list of, "These are all of our drawings that we've ever done in the last 10 years that related to bridges." And that in itself is taking something which is very much a data centre kind of infrastructure level and elevating that up into something which is adding value to the business itself.

Lee Razo:

Yeah. So can you actually set the tags from here?

Chris Dearden:

You can. So you can set the tags based on the query.

Lee Razo:

Okay.

Chris Dearden:

So if I was to look for a certain query or files in certain places, with maybe a certain name type of certain extension, I'll build that query up and then I run that through the tagging engine.

So, yup, anything that comes out of that query, we need to tag.

Lee Razo:

Great. Yeah.

Chris Dearden:

There's also an API available, so we can start to look to do this perhaps in a more programmatic fashion to be able to do that tagging.

Lee Razo:

Oh, yeah. That'll be really big for a lot of companies, so that's great.

Chris Dearden:

Actually, I think, you know, it opens up a number of options. Imagine...

Now there are - one of the tools I spoke about in the first section is analytics tools that look inside the files.

Now it's not something that the Komprise aims to do, but there are tools to do that, and maybe they can tag files that are sensitive.

And if you could actually take an output from a tool like that and then use it to update our index data, then that's extremely powerful because then you might be able to look for sensitive data and potentially treat sensitive data differently.

Lee Razo:

Yeah. That's great.

Chris Dearden:

It's a really, really powerful part of the product.

It's quite understated at the moment, but it's certainly somewhere to look for where we're moving forward in the future.

Lee Razo:

Yeah, especially here in Europe, where I'm based, we have GDPR.

Of course that affects everybody.

So I'd imagine there's a whole lot of applications that are related to that as well.

So this has been really great, Chris.

It's good to actually see it.

We know, we watch lots of webinars.

You know, we've heard all kinds of presentations and things like that, but every now and then I just really want to see what we're talking about.

And, strangely enough, I have a very heavily populated NAS at home with all kinds of garbage and I was thinking, is there a consumer license for this sort of thing? So that it'll help me get my life sorted.

Chris Dearden:

Yeah, I think there's definitely scope to help people clear up all their junk.

Lee Razo:

Absolutely. Very well. Great. Thanks, and look forward to doing more of these types of sessions with you, Chris, so thanks for taking the time and showing us what Komprise can do for us.

Chris Dearden:

No problem. Thanks, Lee, and take care.

Chris Dearden

Sr. Systems Engineer, Komprise

Lee Razo

CEO & Chief Technologist, CloudNativeX