What is Hadoop? Everything You Need to Know

What it comes to data, bigger is better. Almost every business now relies on data in order to run efficiently and there's a constant demand for more detailed, more specific and more accurate information that organizations can evaluate in order to improve their decision-making.

Therefore, big data has become a top IT priority across all sectors. But with data volumes increasing all the time - some estimates suggest global data volumes will hit 163 zettabytes by 2025 - being able to effectively analyze this quickly and cost-effectively will be a top priority. And this is why so many businesses are turning to Hadoop.

Introducing Hadoop

Hadoop is a framework for storing and processing large quantities of data from software firm Apache. First released in 2006 and made widely available to the public in 2011 through the Apache Software Foundation, it's a relatively recent addition to the IT landscape, but one that has quickly become invaluable to many businesses. If you were wondering, it takes its name from a stuffed toy elephant belonging to co-creator Doug Cutting's son - which is still the framework's logo.

A key feature of Hadoop is that it is an open-source platform, supported by a global community of users and developers. With its base code freely available, in theory anyone can pick up the technology and get started, but there are a range of organizations offering commercial Hadoop solutions, which offer the expertise and support most business users will need.

What does it consist of?

Any Hadoop framework can be broken down into four key constituent parts, each of which is responsible for specific tasks with a big data analytics platform, and work together to create a single, powerful tool for managing data.

The storage component of the framework is known as the Hadoop Distributed File System, or HDFS. This aims to cost-effectively store large quantities of data by breaking it into easily manageable chunks. These are then spread over numerous servers, which allows them to be streamed and retrieved very quickly - and also makes storage easily expandable, as capacity, computational power and bandwidth can all be added to simply by provisioning more hardware.

Next, there is MapReduce, which is where the actual processing of data takes place. It works by splitting the data into smaller chunks, with the 'map' part of the process sifting through data to divide it into key-value pairs, which the 'reduce' part then collates, resulting in a much more manageable set of data.

The other two components of Hadoop consist of Hadoop Common, which contains all the utilities, libraries, packages and other files that the framework relies on, and YARN. Standing for Yet Another Resource Negotiator, this is a cluster management tool that offers a central platform for research management across Hadoop infrastructure.

What are the benefits of adopting Hadoop?

One of the biggest reasons for Hadoop's popularity is that, when deployed correctly, it is a highly cost-effective way of conducting extremely powerful analytics operations. Hadoop clusters can be provisioned using commodity hardware, which means that both upfront costs and operational expenditures can be a fraction of other big data technologies. Plus, its open-source nature means there are no expensive licensing issues to worry about.

This cost-effectiveness also makes the technology highly scalable. Because of the way it distributes data sets across many inexpensive servers that work in parallel, it is straightforward to add resources whenever they are needed, so an application that runs across thousands of nodes and thousands of terabytes of data should function and deliver results no differently to smaller-scale operations.

Another effect of this way of handling data is that Hadoop is extremely fast. While not typically suited for genuine real-time processing, it can process terabytes and even petabytes of data in just hours or minutes - something that could take days for tools built on relational databases. As the data is stored and processed on the same servers, this also greatly reduces the time needed to run analytics activities.

Are there issues to be aware of?

While there are many positives associated with Hadoop, the framework is not without its drawbacks, and one of the biggest challenges for organizations is that it is not a platform for beginners. Hadoop is notorious for its complexity and requires a strong level of technical knowledge to unlock its full potential, and this has somewhat limited its adoption - especially among smaller enterprises that may struggle to find programmers with the necessary skills.

Another factor that may hinder Hadoop is that while it has been tailored specifically to handle very large sets, this means it is not an especially efficient solution for handling smaller-scale data needs. This, combined with the previously-mentioned complexity issues, may mean it is difficult for many companies to make the most of Hadoop.

Other issues to be aware of include the potential for security issues to arise. The framework relies heavily on Java, a technology that has become a key target for cybercriminals. This may make it more vulnerable than alternatives and require businesses to focus more closely on manually adding additional security protections. By default, it also does not offer key protections such as full encryption.

None of these issues pose insurmountable obstacles to a successful Hadoop deployment, but they do highlight how Hadoop is not a technology to be taken lightly. Any investment in the framework requires careful planning and a long-term commitment in order to reap the full benefits of what is - in the right hands - a hugely powerful tool for big data analysis.

Tech Insights for Professionals

Insights for Professionals provide free access to the latest thought leadership from global brands. We deliver subscriber value by creating and gathering specialist content for senior professionals.