The Biggest Issues with Processing Data at Scale (And How to Fix Them)

Data is everywhere. It’s created and stored with virtually every action we take in the online world – and even those who don’t even own mobile phones or Twitter accounts can’t avoid generating it through association with those of us who do. The latest Domo report on data usage estimates that Americans use more than three-billion gigabytes every second – and this is only going to rise as time goes on.

This presents significant opportunities for forward-thinking businesses. It allows them to model the behavior of customers, develop new products, and improve existing ones. But processing data at scale also poses some big challenges.

1. Storing data

The ability to store large amounts of data easily is one of the driving factors of the Big Data revolution. The cost of storage capacity has seen a sustained drop, in accordance with Moore's Law; the problem thus concerns how we’re storing data rather than where we’re doing it. Transitioning from a hierarchical storage approach to an object-driven one will make a large database more navigable: just think about the music library on Spotify with its searchable tags.

2. Reconciling data

One of the defining characteristics of the modern data-driven age is the sheer variety of formats in which that data arrives. Information might be hidden inside documents, spreadsheets and databases; it might take the form of images, words and numbers; it might be gleaned from viewing behavior on Netflix, purchases on Amazon, footage uploaded to Instagram and reviews posted on Goodreads.

For this data to be actionable, it needs to be combined into a coherent and meaningful way, so that human decision-makers can assess it (ideally at a glance). In the future, datasets will assuredly become still larger and even more diverse, and this problem will become ever-more pressing.

This challenge will, in all likelihood, spell the end of traditional data-management, and the continued rise of more scalable platforms, which incorporate open-source ecosystems like Hadoop and leverage data lakes and warehouses where appropriate. It’s only through technical innovation of this sort that such vast quantities of data can be consolidated at the required pace.

3. Overcoming machine-learned biases

One of the most exciting branches of data-driven machine-learning concerns predictive artificial intelligence. By feeding a machine millions of pictures, we can improve its ability to recognize patterns. So, an artificial intelligence might be devised to distinguish a bee from a three, as explained in this video by the inimitable CGP Grey.

The same machine-learning algorithms might be used to predict how predisposed a person might be to vote a certain way, buy a certain product, or even commit a crime.

But these algorithms are not infallible. A concerning demonstration of this comes from a real-world machine-learning experiment, which trained an algorithm to distinguish pictures of wolves from pictures of huskies. It learned to do this with an apparently high degree of accuracy. It was then discovered that the machine wasn’t just looking at the animals themselves, but at their surroundings. Wolves are more likely to be photographed in snow, and thus pictures of animals in snow are more likely to be erroneously judged as wolves.

What does this imply? That in order for AI-driven data analyses to be useful in the real world, it needs to be provided with the right quality of data – which means data that’s varied and representative, as well as free from duplication and logic conflicts. There’s no point, after all, in accumulating masses of data if that data is misleading.

4. Privacy and security

Collecting masses of personal data presents considerable privacy and security concerns. These problems have attracted the attention of lawmakers, most notably those responsible for the EU’s significant GDPR legislation, under which Google was recently fined a whopping $57 million. Given a string of high-profile cyberattacks on targets as diverse as Sony’s PlayStation Store and Ashley Madison, companies who collect customer data have an obligation to safeguard that information and not freely exchange it with outside parties (deliberately or otherwise).

Privacy and security concerns will likely be solved through increasingly sophisticated encryption and storage technologies such as blockchain – which has, in recent years, enjoyed widespread attention outside of the cryptocurrency world. Since attackers will undoubtedly seek more sophisticated methods in years and decades to come, those given custody over large amounts of data will need to be proactive when it comes to protecting that data.

This will, of course, require the right technical expertise. And this is the fundamental problem behind each and every data-processing issue we’ve talked about.

5. Lack of talent

The explosion in big data has had considerable implications for engineers, data-scientists and analysts, who have seen their pay-packets skyrocket in recent years. Modern businesses, inundated with more data than they know how to make use of, have never been more desperate for professional help. Last year, the world’s largest survey of IT leaders, the Harvey Nash/KPMG CIO survey, notes that:

“For the fourth year running ‘big data and analytics’ expertise remains the number one skill in short supply, with almost half (46 per cent) of respondents placing it first on their list.”

Several solutions present themselves.

Firms might divert more budgetary resources into recruiting and retaining data-savvy professionals, or acquire other firms where those skills are already present.
They might develop training programs through which existing staff might gain the skills necessary to deal with the age of big data.
Or they might automate the process so that the average worker can realize the company’s big-data goals without the aid of specialist help.

While the last of these three options might appear the more fanciful in the short-term, in the long-term it’s a near-certainty that new programs and algorithms will play a significant role in data analysis, in the same way as GUIs, mice and keyboards.

Ultimately, however, these programs and algorithms will require talented human beings to develop and use them – and thus investment in the right skills is surely a requirement for companies who wish to survive the data-driven future.

Tech Insights for Professionals

Insights for Professionals provide free access to the latest thought leadership from global brands. We deliver subscriber value by creating and gathering specialist content for senior professionals.