AI for AML: Battling bad data to make data science magic

All the AI buzz in the media has painted it as a magic solution to solve all problems. However, the reality is far more complex, and it does not provide all the answers. The true magic of data science lies in its ability to transform raw data into actionable insights, but this transformation is highly dependent on the quality of the data used, and readiness for AI.

While a lot of financial institutions have already begun their AI journeys, most skip the important steps needed, including making sure the data it is trained on is adequate for the use case. Understanding and overcoming the challenges of bad data is crucial for any successful AI implementation.

Watch DIGITExpo's keynote 'Battling through bad data: How to succeed in AI’', or read on below.

‍

What is bad data?

Bad data is data that is not fit for the purpose of the model.

It may be inconsistent either due to not being collected correctly or problems with the detection.

Bad data may contain inaccuracies, either due to how the data was collected or problems in processing.

Insufficient samples draw conclusions, if your sample data size is limited then you may not have enough variation to allow a model to be created.

Even if you have a big enough sample size, bad data could have missing values. If these fields are not handled correctly then it will impact the AI model, and produce an inaccurate output.

The same goes for irrelevant data as it poses the risk of misleading the AI models into learning correlations that are not pertinent to the problem you are trying to solve. And if the data set is vast enough, relevant, and none of it is missing, but misunderstood, you risk introducing bias or building the wrong models based on your assumptions.

As such, it is imperative to fully understand and test data before using it in AI systems. There are a few ways that institutions can avoid problems created by bad data.

‍

1. Matching the size of the data set to the size of the problem

It is important to understand what you are measuring, how much data is required, and what accommodations you need to make in the data. For example, if you were to measure the height of individuals at a basketball convention, even if you took a significant sample, it would be skewed towards taller heights, thus your result would not representative of the general population.

This example seems obvious, but it is a common sampling mistake people make when selecting data to train AI models. Ensuring that the data accurately represents the real-world scenario is crucial to avoid biases and incorrect assumptions. This shortfall is where so many AI projects fail when they go into production.

2. Going far back enough with data to predict behaviour

To make accurate predictions, it is essential to have sufficient historical data. Short-term data might show short-term variations, but longer-term data is necessary to identify trends and behaviors over extended periods. How far back the data you need depends on what you are looking to solve, but finding the right balance helps distinguish between genuine anomalies and normal variations, providing a more robust foundation for AI models.

**3. Make sure anomalies you’re seeing are really outliers**

Outliers are data that deviates significantly from the rest of the data set and removing them usually gives a better result for more of your data. However, is it worth considering is the anomaly a rare event, a sampling error or a measurement error? These are very different and not understanding the route of the anomaly can lead to assumptions in your model development and how it performs in production when it meets these outliers.

4. Dealing with missing data appropriately

It’s important to understand if missing data is a problem with the data pipelines or a natural feature of the data so you can manage it properly. For instance, a lack of daily transactions might be normal for an individual but interpreting a missing blood pressure reading as a normal value could be dangerous. How you handle this can hugely impact your model. You may need to exclude data rows where data is missing. Model pipelines must be designed to handle null values effectively, by removing rows with missing fields, substituting missing data with appropriate methods or by using modelling techniques that can cope with incomplete datasets.

5. Synthetic data is your friend – if used in the right way

Synthetic data can be a powerful tool to supplement small datasets, ensuring privacy and reducing bias. However, synthetic data for training must accurately reflect the real data’s patterns, characteristics, and interdependencies. It should start with some real data or be based on well-defined assumptions. Synthetic data generation is not about creating random data but about generating data that mirrors the real-world scenarios closely. Rigorous testing is required to ensure that the synthetic data holds no unwarranted correlations or unrealistic values.

This takes us to the final point.

6. Test, test and test again

Testing is a critical component of data science. It ensures that the models developed are based on reliable data and can produce accurate, actionable insights. Thorough testing helps identify and rectify any discrepancies in the data, enabling the development of robust AI models that can withstand real-world applications.

Ensuring the use of high-quality, relevant data with known provenance is a challenging task, but crucial. Smaller quantities of high-quality data, coupled with proper risk controls and supplemented by well-crafted synthetic data, can help overcome these challenges. Rigorous data preparation, continuous testing, and stakeholder engagement are essential to building trust and ensuring the efficacy of AI-powered financial crime compliance systems. By addressing the issues of bad data, financial crime compliance professionals can unlock the true potential of AI and make data science magic, because science is magic that works.

‍

Click here to download Napier AI's eBook ‘The optimal path to AI implementation for financial crime compliance’.

‍

Photo by Scott Webb on Unsplash

Chair of the Royal Statistical Society’s Data Science and AI Section and member of FCA’s Synthetic Data group, Janet started coding in 1984 and discovered a passion for technology. She holds degrees in both Molecular Biochemistry and Mathematics and has a Masters in Finance and a PhD in Computational Neuroscience. Janet has helped both start-ups and established businesses implement and improve their AI offering prior to applying her expertise as Chief Data Scientist at Napier AI. Janet regularly speaks at conferences on topics in AI including explainability, testing, efficiency, and ethics. In 2026, Janet was named to the Computing AI Leadership Index, presented Project Theseus as part of the FCA Supercharged Sandbox, and is shortlisted for the British Data Awards- Data Leader of the Year.

Something we said? Don’t leave just yet!

AI for AML: Battling bad data to make data science magic