Poor-quality data can be a serious barrier to a business' success. Here are five common problems you need to look out for, and how you can avoid them.
Businesses today depend on data. It's vital in making the right decisions about a company's future direction, ensuring that customers are offered the products or services that are most relevant to them, and developing a deeper understanding of what's going on in the business and the wider market.
Yet all of this can only be achieved if the data businesses are relying on is accurate. There is a line of thinking that given the huge amount of information businesses now possess, the sheer volume of information being processed by analytics tools will compensate for the occasional error.
But this is not necessarily the case, and if you have wider issues with your data quality, you're likely to end up making decisions based on flawed results. As the old saying goes, 'garbage in, garbage out'. Indeed, Gartner estimates that around 40 percent of enterprise data is either inaccurate, incomplete, or unavailable, and this poor data quality costs the average business around $14 million a year.
Therefore, taking the time to fully review and clean up your data before handing it over to your analytics solutions is an essential step.
With this in mind, here are five of the most common data quality issues you're likely to encounter, as well as what to do about them.
1. Duplicated data
Duplicated data is an issue every business will have to deal with. This often comes about as the result of siloed processes and multiple systems that record the same information. When these sources are pooled together for processing, having multiple copies of the same records can significantly skew the results or lead to wasted effort.
This could result in customers receiving several identical marketing materials, which can annoy users and lead to wasted time and money. Or, it could be difficult to help a customer if they contact you with a query, but have multiple entries in your system with different details.
To avoid this, data deduplication tools are a must. These use algorithms to hunt through very large data sets and identify duplicate records. In the past, such solutions may have missed cases were there were minor differences, but today's solutions are now smart enough to spot even substantially different entries for the same customer.
2. Inconsistent formats
If you are inputting data that covers the same information, but is stored in different formats, many systems may struggle to recognize items as belonging to the same category, so can throw up inaccurate results.
For example, dates are a common stumbling block for many systems, as there are many potential ways these could be entered into different systems. It may be especially tricky for tools to distinguish between US and European-style dates - if you have one data source that uses the DD/MM/YY format and another that uses MM/DD/YY, you can get incorrect results.
Other potential difficulties may arise from phone numbers, especially when some have area codes and others don't, while differences in how data is inputted, such as using 'Street' or 'St' when entering addresses, can also result in more duplication issues. Therefore, it's vital you specify exact formats for every piece of data to ensure consistency across every source your organization uses.
3. Incomplete information
Fields that aren't filled in fully, or are left blank altogether, can be a major pain for tools such as CRM software and automated marketing solutions, as well as big data algorithms. For example, entries that lack ZIP codes aren't just a pain when it comes to contacting customers directly, it can also make key analytics processes useless, as the data will lack essential geographical information that can help you spot trends and make decisions.
Ensuring records can't be created unless all essential information is included is a good start, while setting up systems to exclude incomplete entries may be another way to reduce the issues this can cause.
4. Multiple units and languages
As is the case with formatting, sometimes differences in language, script or units of measurement can create difficulties. There are many examples of disastrous errors being made because someone forgot to take into account these issues, such as the multi-million dollar NASA Mars satellite that crashed because its navigation software was programed in imperial instead of metric units.
Similarly, dealing with data stored in multiple languages can also create difficulties if the analytics tools don't recognize it or know how to translate it. Even special characters such as umlauts and accents can wreak havoc if a system isn't configured for them. Therefore, you have to consider these potential issues if you're dealing with international data sets and program your algorithms accordingly.
5. Inaccurate data
Finally, there's no point in running big data analytics or making contact with customers based on data that is just plain wrong. There could be many reasons for this, from customers giving incorrect information to a human operator making a typo when entering data manually, or inputting details into the wrong field.
These can often be among the hardest data quality issues to spot, especially if the formatting is still acceptable - entering an incorrect, but valid, social security number, for example, might go unnoticed by a database that's only checking the veracity of the input in isolation.
There's no cure for human error, but ensuring you have clear procedures that are followed consistently is a good start. Using automation tools to reduce the amount of manual work when moving data between systems is also hugely useful in reducing the risk of mistakes by tired or bored workers.
Insights for Professionals provide free access to the latest thought leadership from global brands. We deliver subscriber value by creating and gathering specialist content for senior professionals. To view more IT content, click here.