The amount of data held by businesses is exploding. As firms deal with more digitally-focused environments and greatly increase the sources of information available to them, this is placing greater demand than ever on the storage solutions used to manage this wealth of data.
According to IDC for example, the total amount of data held worldwide will reach 175 zettabytes (ZB) by 2025. While around half of this will be stored in public cloud environments, a large amount will still be stored in the core of the network.
This is a figure that's hard to visualize, but if it were all to be stored on BluRay discs, you would have a stack tall enough to reach the moon not just once, but 23 times.
Therefore, the need for storage solutions in the coming years will be huge. However, much of the data businesses hold is likely to be redundant, as it’ll be stored in multiple locations throughout the company, taking up valuable space. Therefore, making efforts to remove duplicated data will be useful for both costs and productivity, especially when it comes to areas such as backup storage solutions, where keeping data volumes to a minimum is vital for cost-effectiveness.
The problems caused by data duplication
Duplicated data can lead to a range of issues for organizations. As well as the added hardware costs needed to store this information, duplicated data can lead to confusion and inefficiencies if people are working from different records. If amends and updates are only made to one set of records, this can quickly lead to individuals working from inaccurate information.
This can affect everything from forecasting to customer satisfaction, and lead to poor decision-making and wasted effort. According to Kissmetrics, poor-quality data can lose firms up to 20% of their revenue, and duplication is one of the most common causes of this.
5 key factors to ensure your deduplication efforts are successful
While prevention is better than cure, and taking steps to ensure data is accurate and not duplicated at creation is the best way to avoid this issue, there’ll still be existing records that’ll need to be cleansed. There are many data deduplication tools you can leverage to achieve this, but there are a few key factors you should keep in mind when setting out to clean your data.
1. Choose the most suitable method
There are several types of deduplication, but they all work by looking for repeated patterns within chunks of data. However, they can differ in how they go about this. For instance, file-level deduplication reviews entire files and is typically the cheapest and fastest option, but can be less efficient, as any changes will result in new copies of files being created. Block-level and byte-level tools, on the other hand, work on the sub-file level, and can more accurately spot duplicated data. However, this is more time and resource-intensive.
2. Consider what data types you have
Not all data types are alike, and this can impact how effective deduplication efforts are. Some file types, such as photos, audio, video, imaging, or computer-generated data, don't deduplicate well, so they should be stored on non-dedupe storage so they don’t unnecessarily add to the workload. Also, encrypted data, system-state files, and files with extended attributes won't be deduplicated.
3. Don't focus on reduction ratios
Many deduplication vendors make promises based on how much they can reduce your data by - so, for example, a claim of 5:1 should reduce your required storage by 80%. But a ratio of 10:1 won't be twice as effective, as it’ll cut your space by 90% - only a 10% difference. However, these are still only estimates, and actual reduction rates can vary widely based on factors such as:
- Type of data being deduplicated
- Type of backup being deduplicated
- Frequency and degree of data that’s changed
4. Determine where to perform your deduplication
Using deduplication tools on every storage media you own is unlikely to be useful or cost-effective. Instead, it’s only necessary for secondary locations such as backup where cost is a primary concern. Deploying deduplication tools in primary storage locations like data centers, where performance is a top priority, should be avoided as it can impact performance.
5. Ensure you're taking all costs into account
It's also important to keep in mind the full range of costs that are involved in deduplication processes to ensure you aren't surprised by any hidden fees or delays. In addition to the costs of physical storage for deduplication tools, there are the maintenance and management factors to take into account. For example, if you opt for backup software with built-in deduplication technology, you'll be responsible for integrating this into your systems, so you have to consider the resources this will require.