Tackling machine learning enemy #1, poor data quality – an interview with Sahar Changuel, PhD

How did you get passionate about data quality?

Poor data quality is enemy number one to machine learning.

Throughout my career, I was always working with data. While data quality was not always part of my responsibilities, data quality was preventing me from successful completion of projects at most stages of my career.

The topic of data quality is interesting to me, because it is challenging. I take pleasure in tackling complex problems and finding solutions.

Identifying the root cause, starting a project and finishing a project are not easy tasks. The element of communication is also very important and enjoyable for me. You have to communicate about the problem, the challenges, update stakeholders about the status.

Your PhD, „Metadata for personalization and access to knowledge,“ is fascinating. What insights from your research are still important for use cases that will be put into production in 2023?

Having a mathematics background, I am also passionate about machine learning and algorithms. Being able to extract semantics, meaningful information from texts, creating ontologies and taxonomies, and using techniques to get results that fit user expectations is a challenge that interests me.

I completed the PhD more than 10 yrs ago. During the PhD I focused on how to automatically extract valuable information from unstructured data, from particularly documents. I used text mining techniques to deduce information from text data and used NLP techniques to clean the data into a usable format.

One of the challenges during the PhD was to find the data and also have the data in good quality. What is quality? I was dealing with machine learning techniques, so I need annotated data. In supervised learning, I need data that is already classified, annotated.

I spent a large amount of time manually working with the data so that it was of good quality. After I got the data to sufficient quality, I could focus on the heart of my PhD, which was choosing which algorithm to use and what method to apply.

The challenge for any data scientist is to have data in good quality so that you can solve the problem.

This is a problem with any data, especially text. I had to understand the content and extract meaning from the text. I wanted to make sure that my work was based on good quality data. Text analytics has grown in importance over the last decade. About 80% of data in the organizations are commonly estimated to consist of unstructured text. One challenge is how to extract meaning from a massive amount of text. Greater familiarization with techniques to enable reliable analytics of text is a current imperative.

Why were you interested in data quality before it got the spotlight it deserved?

Data quality is enemy #1 to machine learning. Machine learning can only be as good as the data you input. No matter how much time you put into the algorithm, if garbage goes in, garbage comes out.

The fact that I am a machine learning practitioner made me very sensitive to bad quality data. As data scientists, we need to tackle poor quality data. One cannot go far with machine learning with poor quality data.

The first impulse is often to hire data scientists to solve complex problems. The reality is that lots of data scientists need to deal with data quality. It is frustrating.

Data quality needs to be tackled as a subject in its own right, not as a task to be done once a month.

Why should data quality be an organizational priority beyond the data science team?

Poor quality data does not only impact machine learning, it impacts our decisions. In a data-driven world, we are more and more dependent on data to make decisions.

If the data is not of good quality, we make bad decisions.

Why do so many organizations struggle with data quality?

There has been a lot of effort to be data-driven, to improve business efficiency and be more competitive. As data is becoming more valuable, the impact of bad data is larger than before.

There has been a lot of money invested in cloud computing and other technologies enabling companies to be more data-driven. There is now a need for investment in the quality of data.

Why have so many been putting up with bad data for so long?

The impact of data 30 years ago was not as much as the impact on data today. We are storing data on the cloud, to want to hire qualified people to make the best use of technology. When these highly qualified people want to use the data, they realize the data was not of good quality.

The volume of data increased over time. As companies increasingly rely on data to help execute business decisions, the impact of bad data is larger than before.

If a company makes a New Years Resolution that 2023 is the year they are going to have good data. How do they start?

Include data quality in business conversations: The company should start by adding data and information to the business conversations. The business needs to realize the importance of the data and the data quality to achieve its goal. Business and technology leaders need to ask:

What kind of data is needed?

Do we have the data we need?

Do we trust the data we have?

Is it at the quality we need to be usable?

Do we have the data quality we need to achieve the global goals of the business?

Invest in training and education in data quality:

For existing employees

We need to have education about data quality management, which could include training, conferences, certifications. We need qualification on processes to get data quality. Data management needs to include the subject of data quality.

Integration of the topic in degree programs, also for non-data specializations

Data quality does not only have to be a topic in data and computer science, but also in business, for example. Because people are making decisions based on data, data quality is not just for techies.

Many business students become leaders and need to understand data quality so they can prioritize resources for the quality of the data on which they depend.

What is the key factor for success in a data quality project?

Identifying the root cause and not a symptom is critical. A remediation plan needs to tackle the root cause. One type of remediation can be a training for the people creating the data, because they might not have visibility as to how their work is impacting the organization’s data quality.

A larger organization might need a bigger solution. Even in a large organization, you can go for quick wins sometimes. It depends on the problem and the impact of the problem.

What are some common mistakes companies make or misconceptions about data quality?

Generally once a company identifies a data quality issue, a company launches a massive initiative to clean the data.

The primary goal of a data quality initiative should be to find out why a problem happens and prevent it from happening again. There needs to be a root cause analysis and tackle the root cause.

Companies should put in place preventative measures to avoid bad data quality instead of creating massive initiatives to fix data quality retroactively once problems arise.

How do you define quality?

There are standards of data quality that are defined according to different dimensions:

The dimensions of data quality are:

Accuracy of data: if data is correct or not. Generally, this is what people think of being data quality.

Coherence: Data stored in different systems should be consistent and have the same business logic.

Timeliness: Data is available at the time it needs to be utilized.

Completeness: Whether the data is complete or not, for example, if there are null values. Uniqueness: There are no redundancies, for example one client cannot have two IDs.

Depending on the business needs and on the data quality issues, certain dimensions are more important.

What are some tips for companies to make sure they finish their data quality projects and reach their 2023 data quality objectives?

The key is to prioritize the efforts. Identify the critical business need. Prioritize and identify the most important root causes and make sure the right measure is put in place.

Don’t do everything. Prioritize and plan.

How are some methods to track and maintain data quality?

Putting into place rules, controls and metrics is important.

Rules entail data quality requirements based on the data quality aspects mentioned above.

Controls are monitoring mechanisms we put in place to monitor data quality metrics on an ongoing basis.

Metrics are based on the rules, showing a certain level of required quality. (Explained further in the next questions)

The monitoring and control will happen even when the project is over and action will be taken when the project is over.

Prevention should also be included into the process. So that we can prevent data quality problems at the time of creation.

What skillsets does a company need to hire and staff to be able to manage data quality

I would like to clarify that data quality cannot be sustained without data governance. Data governance defines the policies, procedures, roles, and responsibilities for the effective management of data.

Successfully completing a data quality project requires knowledge and skills in business, data, and technology. Additionally the ability to communicate at many levels within the organization and the ability to do detailed analysis, query data, interpret data models, and coding skills are required. You cannot expect that one person can have all these skills. Thus, there are many roles that can be involved in the project. Good project management skills are required to organize and manage the work and to ensure your project stays on track.

Good communication skills are vital too: presentation skills, negotiation, decision making, data storytelling, listening, writing, ect.

What are some KPIs for data quality?

We define the data quality dimension. i.e. This data should have a certain accuracy or level of completeness. We make a rule and a KPI that measures the rule, and we put it in a dashboard.

There are two level of KPIs:

Granular: For example, the rate of accuracy of emails. You then apply rules to emails, for example that emails should have a specific format, etc.

Aggregated KPIs: These are global KPIs that would be reflected in a dashboard. Some examples could be level of accuracy or level of coherence. Looking at a specific geographic area or a particular business scope, a KPI could be to what percentage of the data is in good quality or bad quality according to specific rules.

Threshold:

Once we define the KPI we need to define the threshold. If the data quality is below a specific threshold. i.e. if we have 85% data quality, we can still say we have data quality.

The threshold has to take into account how critical the data quality is, meaning impact on the business and the risk posed by the data quality.

How can you engage with stakeholders to create meaningful data quality requirements?

You ask “So what?” If the data quality is wrong, what is the impact?

What is the good effect of good quality data and for bad quality data?

These questions will enable discussions about the impact of data quality on the business. Both quantitative metrics and qualitative measures can be outcomes of this discussion.

Do you have any tips on how to put a monetary value on data quality? In my personal experience, data quality is seen as overhead and thus subject to under investment.

Before thinking of monetary value, you have to think about the business impact of data quality. As data professionals, we can help stakeholders identify the good impact of high quality data and the negative of poor quality data.

Then we can establish the business case for data quality improvement. If we help stakeholders to identify the business impact of data quality, they will be more willing to invest money in a data quality project.

Going back to the example above. If the email is wrong, we can’t talk to the client.

What are a couple of secrets to your success?

Focus on business needs. Data quality should never be done for its own sake. Money should be spent for a business need.

Communication is key. Engage people. Gauge their reactions, gain trust.

Who is Sahar Changuel, PhD?

Sahar Changuel has a PhD on machine learning and Natural Language Processing (NLP) and during the last 10 years of experience hasn’t stop working with data: structured and unstructured in different use cases and for different services: education, media, audit, finance…. Her mandate was almost the same each time, it consists on making the data usable and exploitable in the best way that fits the business need.

In order to make the best use of data there is one fundamental requirement: data must be in a good quality, which can be a real challenge in some situations. Today, Sahar is a Senior Data Manager and as a data professional she has the responsibility to provide the highest-quality data possible and encourage confidence in it.

Schreibe einen Kommentar Antworten abbrechen