Wednesday 11 March 2015

How to Avoid Drowning in a Data Lake


So what exactly is a data lake? And how does one avoid drowning in it?
An excellent questions and one that on the face of it is relatively easy to answer. According to Gartner:
The growing hype surrounding data lakes is causing substantial confusion in the information management space, according to Gartner, Inc. Several vendors are marketing data lakes as an essential component to capitalize on Big Data opportunities, but there is little alignment between vendors about what comprises a data lake, or how to get value from it.”*

Gartner sums it up quite well. A data lake is a marketing spin on Big Data. It is another term for basically the same thing (i.e. lots of data), although those talking about lakes would say that they are more contained and easier to manage than Big Data. I suppose this could have some element of truth, if it were not for the dollar sign that has just been placed before the term! To sum up: A data lake is another Big Data term, but generally used for a smaller chunk of Big Data. But still a lot of it! So data lakes really are Big Data. I hope that has not confused the issue!

Now on to the more pressing issues around Big Data and data lakes. How do you manage them so that you are not overwhelmed with the amount of material at hand?

Over the last two years there has been a lot of hype around Big Data use. Hype that says all of this data can allow you to grow market share, help you build more responsive systems, or can even help predict the rate of flu virus spread. But, this has only has been found to be true in a very few cases.
Generally the hype around all of this data has been just that, hype. The promised land of sales growth or better customer experience has not manifested. Instead, we have a large number of organizations paying a river of cash to a smaller subset of suppliers for statistics from data that is in essence useless because it’s in the form that the statistical model dictates and has additional data points added. This is not to say that the data itself is useless. Rather, that how we have so far been using it, has little or no practical application use to the purchaser due to its format.

Remember the old adage, “You can make statistics mean anything you want?” Well, add millions or even billions of data points and then create your statistics and they will say whatever you like. Set your proof points to be ‘x’ and you will get enough of them to prove it. Whatever pattern you are seeking will appear because the data universe is so large. This is when you have truly drowned.
I believe the best way to manage and use this data is to make it in to bite-sized chunks. Let us leave lakes well behind and start to talk “puddles.” A data puddle is easy to manage. When was the last time you heard of someone drowning in a puddle?

You have massive amounts of data – yes. But your customers are individuals.
The true power of your data lies in the small puddles about individual customers or users. Those pieces of data that can say, “Last week Jim used 140% of his allocated resources.” And lets us discover “Why did Jim’s use spike? Does he need an upgrade? Or is it a one off?”
Now we can go back and look at some historical data on Jim and create a smaller set of information about Jim and maybe those others that are directly associated with him or her. Once we have this data, we can start to see if there is a pattern. And then we can start to harness the information that we have.

Instead of imposing patterns that may not exist, it makes more sense to extrapolate from smaller patterns to larger trends. By analyzing smaller, more relevant sets of data, we’re putting our data to better use and using it to increase our bottom line.

So to sum up: I believe the best way not to drown in a Data Lake is not to have one. We have masses of data at hand, data that is really important. Data that is Big and data that takes up Terabytes of storage. To truly harness this data and not drown in it, we need to start from the basics of what we are looking for and seek it in smaller, simpler sets of data, not take it all and try to fit it to what we or someone else wants us to see.

* (Gartner Press Release, Gartner Says Beware of the Data Lake Fallacy, July 28, 2014, http://www.gartner.com/newsroom/id/2809117)

Lock Up Your Data and Throw Away the Key Store


Our passwords, credit cards, and email addresses are under siege daily as cloud store security breach headlines continue to hit the news.
In a lot of these stories, the data in question was encrypted. Not just hashed but truly encrypted with keys, presuming that unless the thief also manages to access the key store then your information is safe.
But safe from whom? If a thief, then yes. If your keys are themselves secured, then your information should be safe.
However, in my experience, most hacks seem to come from an internal source, such as from an unhappy employee, an ex-employee who was sacked this morning, or an employee who has an axe to grind. The disgruntled employee can use inside knowledge to share a virus, share documents with rivals, or misuse company and personnel data. If this organization is a cloud store or service provider that also holds and owns your encryption keys, then in any one of these cases your information is far from safe.

The recent stories about the sharing of celebrity nude photos and emails has caused individuals and companies to wonder about the security of data stored in the clouds and ask such questions as: Is the data encrypted at the server, while in transport? What level of encryption is used and how much authentication is performed?

As that employee could also have access to the keys to the cloud store, and then your data is no longer encrypted. This is not as far-fetched as it may seem. This has been the case for many breaches over the past few years. It is, however, hard to substantiate that statement as the industry resolutely refuses to talk about breaches “for security reasons!”

And what about those scenarios when a government or legal authority decides that they need access to your corporate information? This is not necessarily theft, but it can be unwanted access. According to the US’s Communications Assistance for Law Enforcement Act (CALEA), a communications provider of any size must allow government agencies access to data. The service providers are not told why the data is needed, only that they must comply.

Government should have the right to do this. In fact, I believe them when they say that having this right has secured us all from many security threats. The question here though is one of accountability. If your supplier owns your security, then they are obliged to pass over not just the documents, but also the keys that allow this information to be decrypted without your knowledge.

The issue is not that the government has access; the bigger threat is lack of knowledge about where corporate data is headed. If you had ownership of your security, then the government department would come to you directly, giving you the opportunity to directly pass this information across with full knowledge and the accountability that goes with that.

In summary, if you pass your security to a third party, and they own and store your encryption keys, then you have lost control of your information. It is imperative that you own and store these separately from your cloud suppliers. If you do not, then your information can be stolen or subpoenaed without your knowledge. This in turn could cause you both monetary loss and possible customer embarrassment.

How to Avoid the Most Common Mistakes of Implementing Big Data Solutions


Ok, we agree that you don’t have a Big Data problem, but you do need to manage and extract value from your ever-growing data sets.

And we agree that unnecessarily complex approaches to dealing with data hinder our ability to create value. We don’t benefit from complicating the solution.
Big Data solutions can leverage enterprise data to build market share, develop more responsive systems, and conduct more effective research, among other things.

So when you find yourself in need of purchasing a “Big Data Solution,” what can you do to minimize the risks and maximize the opportunities afforded by the project?
Here are my top 6 recommendations for avoiding the most common mistakes of implementing Big Data Solutions:
  1. Choose the correct partner – Whatever solution you choose, the technology will outlive the tenure of the team choosing it and the CIO blessing it. For the investment to bear long-term gains, you need to consider the solution that can best adapt to future needs and uses.
  1. Fully plan the project before implementation – It may sound basic but it bears repeating, plan the project in advance. This allows you to identify project owners within the organization and address project risks in advance. You can consider current needs and uses, but also consider future applications. It helps you choose the right partner (see above) and it helps ensure a successful roll out.
  1. Remember: Big Data is NOT just about the analytics – When choosing a Big Data solution, there are considerations beyond the analytics that often get left out of the decision process. Beyond analytics, it is important to plan for data capture, search, sharing, storage, transfer, and privacy, to name just a few considerations.
  1. Account for live data AND historical data – Know which of your project stake holders require access to live data and which will need to work with historical data so you can plan accordingly.
  1. Keep the data accessible – Your data is only valuable to you if you can access it readily and work with it easily.
  1. Remember to secure the data – Yes, you need to access the data easily and readily, but if it’s not secure you risk exposing enterprise assets. Is any of your data being stored in the cloud? If so, is the data encrypted and at what level? Are the keys secured? These are important questions to answer as you plan the project in order to ensure the security of your enterprise information.
Good luck!

Searching For Productivity


People use cloud storage for their work documents because it makes working remotely and sharing documents easier. Period. They generally don’t consider the security of the documents in the cloud store, or if they do they probably consider the potential for loss to be small and a reasonable risk given the benefits to their workflow.

This is why, as a TechCrunch article recently described, the cloud isn’t going anywhere. (Why the Cloud Isn’t Going Anywhere by Arsalan Farooq, is it the end of the cloud as we know it?)
While the article describes a number of solid reasons that the cloud isn’t going away as a work tool, even given the recent highly publicized security breaches, I think the main reason is because it streamlines workflow and makes work life easier. It does this in the same way that finding the exact information you are looking for, at exactly the moment you need it, streamlines your workflow and saves time.

We use search because it is so much more efficient than seeking a document manually. It only stands to reason that applying Federated search to the post-cloud work environment, where we have files stored in multiple places, is the next step in increasing work productivity. Federated search is a natural fit for working remotely and from various computers and devices, especially when some of the content is stored in the cloud.

Federated search, or the ability to search multiple sources, organically supports workflow in the manner that we work today. If we increase the accuracy of our search and search within files and emails that are housed all across our cloud, even better results and increased efficiency results. Search becomes an even more powerful tool for simplifying workflow, just like using the cloud. Bringing the two together makes the cloud work for us.

But we ought to address the privacy issues raised by using the cloud, especially since the cloud isn’t going away. We shouldn’t accept that a risk to our privacy is a small price to pay for increased efficiency. We should enable both elevated security and powerful search to both increase efficiency and protect our digital assets when working in the cloud. This is why an integrated search and security tool makes sense given the way we work today. Not only is this not the end of the cloud as we know it, it is the beginning of an era of encrypted search that protects our assets and streamlines our workflow.

You Don’t Have a Big Data Problem


The term Big Data gets tossed around a lot these days, and there are so many Big Data solutions for sale to address Big data Problems, that we seem to have lost sight of what Big Data really is! So, what is it?
Big Data is a term that refers to the fact that companies are storing more and more unstructured and unsecured data than ever before. Every single day the data pool grows. The truth, however, is this does not constitute a big data problem. In fact, the volume of data you have is incredibly valuable to you.
What you have is a problem securing and accessing the data.
Unnecessarily complex and complicated approaches to dealing with the data, storing the data, managing the data, and securing the data prevent you from quickly and easily accessing your data, let alone deriving value from it.
In trying to solve the problem of securing your data, you have actually created obstructions in extracting value from that data and have laid the path to more and more complexities.
We live and work in an increasingly mobile-driven and cloud-based environment where security breaches are daily occurrences. Privacy concerns are warranted and security measures are necessary so that your massive stores of unstructured and unsecured data don’t leave your network, applications and data exposed.
We need a simple, logical approach to securing our growing pools of data. We must fill the critical security gaps in cloud service providers, applications and operations systems, and we must not allow these security gaps to take hold and compromise data.
Yet the challenge of deriving value is not simply because we need to secure the data. While it is true that security measures have made it more difficult to access data, even without critical security measures in place we are not always employing the most efficient search methods available in order to access data.
We need a simpler approach to searching our data that allows us to quickly locate exactly the information we need, no matter where it is stored, from what device we are working on to retrieve it or what security measures are in place.
While I do not believe we have a Big Data problem, we do have problems associated with the growth of our data that we could take a simpler approach to solving. In the future on this blog I intend to explore how we can take a more direct approach to the problems associated with the growth of enterprise data, including security, privacy, and accessing value.
I hope to be the antidote to complexity when it comes to developing solutions for accessing and securing enterprise data. And by keeping the language simple I hope to show how it is possible to maintain a secure environment and have it quickly and easily accessible to those with the correct access privileges.