How to Cleanse Data at Scale


Tech Insights for ProfessionalsThe latest thought leadership for IT pros

Tuesday, July 9, 2019

Data cleansing is a vital part of any firm's analytics strategy, but how should they go about this - especially as data volumes are on the rise?

Article 6 Minutes
How to Cleanse Data at Scale

Businesses are now more reliant on data than ever before. Information is not only used in reporting and forecasting, but when combined with the latest powerful machine learning and artificial intelligence techniques, effective use of data can help transform almost every aspect of a business. This stretches from delivering more personalized services to consumers to streamlining the supply chain and manufacturing processes by identifying where efficiencies can be made.

However, all of these results are only possible if the information being fed into analytics platforms is as complete, relevant and accurate as possible, and this is still a challenge for many businesses. Particularly when dealing with unstructured data, a lack of consistency or standardization of data inputs can greatly reduce the effectiveness of a big data analytics program, as it means processes are beginning with incorrect assumptions and thus will not output usable results.

Even the most basic activities like maintaining a list of customer contact details can be hampered by poor data. If records are in the system twice, for example, you may end up annoying customers by spamming them with multiple copies of the same marketing email, or if you're sending out physical mail such as invoices, you don't want them going to addresses that don't exist.

Why data cleansing matters

Data cleansing is an essential first step before organizations can get to the primary business of actually analyzing the information they have. This involves identifying and fixing any potential issues that could compromise the quality of the data and traditionally involves looking out for data that is incorrect, incomplete or duplicated. However, there's now also a new category of data cleansing that has to been taken into account in the age of GDPR and data privacy - filtering out information that is irrelevant.

Some of these inaccuracies may be the natural result of information becoming decayed and outdated over time. Names and addresses can change as people get married or move home, for example, or if customers have died. There are few things more likely to hurt a brand's reputation than sending cheery marketing messages to the relatives of deceased customers.

Meanwhile, poorly-designed forms that allow users to enter data in non-standard or unstructured formats, or leave vital fields unfilled, can also skew the results of any data processing and analytics. For example, if you have customers around the world and ask them to fill in their date of birth, you need to provide clear fields for this and not simply leave it up to the customer to use plaintext. The other side of this is how does your system know if '04/05/90' is May 4th or April 5th?

Fixing all this is no small task. Indeed, one survey estimates that data scientists spend 60% of their time organizing and cleaning their data, compared with 19% of their time collecting data sets and just 9% actually mining it for patterns. And as the volume of data is only set to grow even further in the coming years, it may quickly become overwhelming.

The key stages of data management

Every data management system has key steps that must be undertaken before analytics processes can begin, and there are a few things businesses can do at every stage to make the data cleansing process as swift and painless as possible.


Data gathering activities are the obvious place to start. Today's data comes from multiple sources, and many of these will provide unstructured raw data that is initially hard to work with. But by planning ahead and determining a few basic rules and guidelines, you'll save yourself a lot of work later.

For example, try to ensure forms are as standardized as possible, giving users as few opportunities as possible to get creative with entries that might break things. Also, if you're sourcing data from a spreadsheet, try and make sure it all comes from the same worksheet to ensure consistency.


This stage is where the bulk of the work is done. Before you load the information into your warehouse, there are several tools available that can help minimize issues, both large and small.

Running the data against other external sources, such as third-party directories, can help identify factual errors such as incorrect addresses or missing ZIP codes. Meanwhile, in-built tools in applications like Excel can tackle smaller, but no less problematic issues, such as inconsistent formatting, extra spaces and duplicated records.


Extract, Transform and Load  (ETL) techniques that can pull data from its original source and import to the data warehouse also offer tools for cleaning data during the process. For example, effective solutions should be able to evaluate new incoming data by comparing it against historical information to spot any oddities.

The transformation process should also include a list of rules that can be applied to incoming data to ensure the format is consistent and the business logic is dependable and based on user requirements.


Data cleansing isn't finished once it's in your data warehouse. Data can still be enhanced and appended with additional information to make the reporting stage more accurate or insightful by adding to its context.

Also, don't be afraid to rely on your gut. If the outcome of a report seems unusual or outside your expectations, this might not mean your assumptions were wrong, it might mean the data was. If this happens in your reporting, it's worth going back and taking a closer look at your sources.

Simple steps to clean your data at scale

Many of the steps involved in data cleansing are actually fairly straightforward, if sometimes time consuming. For instance, spreadsheet tools that can find duplicate data and remove it, or clean up individual records by tidying and reformatting fields to eliminate unnecessary space and inconsistencies. You can also use conditional formatting to highlight any entries that are outside the norm to take a closer look at manually.

But while these are useful tools, however, they're not foolproof, particularly if there are minor differences in otherwise identical records, so you shouldn't be relying too heavily on them.

As businesses scale up and take on more data, turning to automation and machine learning tools will be essential if businesses are to keep pace. One way of improving data quality at the earliest stage is to use robotic process automation (RPA).

These basic tools are intended to streamline heavily structured but repetitive tasks, so may be well suited to some data cleansing operations. When gathering data from initial sources, an RPA bot can be programmed to evaluate each one based on a set of key business criteria in order to validate it, or send it back to the source for correction if it appears to be inaccurate or incorrectly formatted.

Machine learning is another technology that is being increasingly applied to data cleansing. To do this, you first need to teach the AI what to look for, which involves knowing what type of errors it should be searching for, how these errors can be identified, and what remedy to apply.

Once an AI system understands these basic criteria, you can let it loose - and the great thing about this technology is the more data it looks at, the more accurate it will get, making it the perfect solution for managing data cleansing at scale.

Tech Insights for Professionals

Insights for Professionals provide free access to the latest thought leadership from global brands. We deliver subscriber value by creating and gathering specialist content for senior professionals.


Join the conversation...

11/07/2019 Paul David
absolutely vital that the data is cleansed, interesting article