Why data quality is crucial for data integration projects

Listen to this episode:

About this episode:

In this episode of Behind the Data, Matthew Stibbe interviews David Howell, Director of IT and Digital Transformation at Chugai Europe. They discuss the exciting developments in the world of data, such as AI and machine learning, and the risks and challenges that data professionals need to be mindful of. David shares insights into Chugai Europe's digital transformation journey, focusing on simplification, efficiency, and automation. They also explore the importance of data quality, naming conventions, and management buy-in. David highlights a specific data integration project using CloverDX and emphasizes the need for the right tools to achieve project goals.

AI-generated transcript

Matthew Stibbe (00:01.646)
Hello, welcome to Behind the Data with CloverDX. I'm Matthew Stibbe, your host, and I'm here with David Howell, who is Director of IT and Digital Transformation at Chugai Europe. Hello, David, thanks for being on the show. I'm grand, thank you, I'm good, and you?

David Howell (00:14.411)
Well good morning, how are you?

Matthew Stibbe
I'm grand, thank you, I'm good, and you?

David Howell
Yep, very well.

Matthew Stibbe (00:19.598)
So before we start exploring your career and the world of data at Chugai, I'd like to start with just what is going on in the world of data that's recently happening, new technology, new ideas that are exciting you at the moment.

David Howell (00:34.507)
Well, I think like most people, everybody's talking about AI and machine learning. It's the big buzzwords. You can't even look at any sort of products, literature, go to trade shows, webinars, whatever it is, everyone's talking about AI and machine learning. And this is what is going to transform the world. And we've already seen that in many of the probably...the ways that we live today, Amazon Alexa or Siri, all those kinds of things. And it's even moving forward, probably a far more accelerated rate now than it was when it first started 30 years ago.

Matthew Stibbe (01:07.278)
Well, I didn't think twice before this interview of going to chatGPT and say, how do I pronounce Chugai as in Chugai Pharmaceuticals? And it came back and it told me and I hope it, well, I think it was right from what you said. So it's sort of starting to pervade our world. But do you think there's any risks or anything that data professionals need to be mindful of?

David Howell (01:29.963)
I think really it's just getting that guarantee or that certainty on what it's been trained on. As I said there's some really, you know, from personal experience, there's some really good use cases, like you just said, in terms of how do I pronounce Chugai. But also, you know, I've seen some quite wild and wacky responses, again, depending on the prompts, depending on, you know, the information that you ask it from. But again, also really, as I said, what it's been trained on.

Because we've had recent examples of Google Bard referencing things like Reddit and some of the responses are not ideal. So I think there's a real risk. But again, it's no new or it's no different to what we had with Google. For example, if it's indexing a website and it's presenting the facts back and you read it.

Again, it all comes down to the person taking that knowledge. You know, if you go to a library, it's probably more guaranteed that the information is correct because it's been curated, it's been reviewed, it's been peer-reviewed. You don't have that on the internet. So again, I think it really comes down to the information that you train it with.

Matthew Stibbe (02:32.782)
It's like the dying history professor telling his devoted students his last words, verify your sources, verify your sources. 

David Howell (02:41.003)
Exactly. Yeah.

Matthew Stibbe
So tell me a little bit about Chugai, and particularly Chugai Europe, tell me about your world.


David Howell (02:49.643)
So we are a pharmaceutical company. We're headquartered in Japan. So we've got a long heritage. We're a hundred years old next year. We operate across Europe. There's about 180 people and we're across the UK, France and Germany. There's two parts of our business. One is a clinical development and the second one is commercial and sort of sales marketing type activities.

Matthew Stibbe (03:14.702)
And tell me a little bit about your role there.

David Howell (03:18.251)
So I've been at Chugai for 24 years now. I head up IT and digital transformation. Digital transformation is new. That started this year in January, officially. And the real scope and role of that expansion is to try and look at solving some of our business challenges using digital transformation. So part of that is machine learning, AI, all underpinned by good, strong data management.

Matthew Stibbe (03:42.83)
Can we explore a little bit some of those challenges and slash opportunities that you're trying to solve? What do you think the biggest payoff for digital transformation will be for Chugai?

David Howell (03:55.755)
I think it's really around simplification efficiencies and automation. I think probably like a lot of organizations of a similar size, we're kind of probably considered small organization. We've got lots of information across the business, spreadsheets, emails, it's just everywhere. But the challenge with that is it means that we can't make accurate, timely, informed decisions.

So we've already been on a journey to establish things like data warehousing and reporting and dashboarding off of those. And we've got some automation of data sources. But really, that's where we need to focus more, is preventing or reducing the amount of effort that people need to do to generate these insights.

Matthew Stibbe (04:42.862)
And you're about 180 people in Europe, is that right?

David Howell (04:45.355)
Yeah, across Europe, yeah.

Matthew Stibbe
And this is not medical or pharmaceutical information. This is just regular business data, sort of CRMs and things?

David Howell (04:53.227)
Business operational information. So it comes from, yeah, we've got various sources. So things like warehouse information. So how much product that we sold, how many medicines have we sold through to, you know, could be HR data. It could be, you know, could be activity in our CRM or our marketing automation tool. So yeah, website traffic, all that kind of thing.

Matthew Stibbe (05:14.446)
Can we do a little bit of a name check of the kinds of systems that you've got? I think you mentioned Salesforce. What else is in the stack there?

David Howell (05:21.931)
So Salesforce is our CRM. We also use Marketo for digital marketing. We've got, you know, most of our data source comes in from things like Excel. We've got third party information that comes from that. We're part of the Roche group as well. So we get some information from theirs and they use things like Tableau and you know, they've got their own data warehouse and they use some pretty big tools like Snowflake and some other bits and pieces, but for us here in Europe, we're more around a fairly traditional data warehouse, I would say. I refer to it as a data puddle rather than data lake.

Matthew Stibbe (05:57.038)
It's a data shed, not a data warehouse, right? Data garage.

David Howell (06:00.555)
Exactly. I mean, maybe a small warehouse. Shed. Yeah. But yeah, I mean, for the size of organization we've got, I think we're on the right journey anyway.

Matthew Stibbe (06:12.814)
And can you give me a couple of examples of what you would like to see happening when you started to join these things up and move the data around intelligently?

David Howell (06:22.443)
So we've been using, well, we started primarily with Microsoft SQL and we've had this over a number of years. I mean, I've worked at Chugai for 24 years and we'd probably use SQL for most of that time I've been here. Some systems, they're very discrete, finance systems don't really talk to CRM systems. So our plan really was to have a central data warehouse where we could start to bring in that information to a single space.

I think really for me, it's having that roadmap, that clarity and the prioritization of what data sources are most important and where do we go get them.

Matthew Stibbe (07:03.118)
So that data warehouse, will be in the SQL server.

David Howell (07:05.995)
It is, yeah.

Matthew Stibbe (07:12.622)
When you're starting to do all of this, can you give me a vision of what data nirvana looks like for you? Where you're headed towards? What will people in the organization be able to do that they can't do today?

David Howell (07:28.491)
So I think for us, it's really important is there's a couple of things that I said, I mean, we made a load of mistakes when we first started with, you know, things like naming conventions and all the technical stuff, like, you know, how do you differentiate tables and where they come from and, you know, the challenges of, you know, view sprawl and we have test data in with production. So there's a whole, we didn't get it right to start with.

But I think going forward, it's really about that clarity. For me, data nirvana is having a central set of information that anybody in the business can use, subject to obviously need to know basis, but having that information in one place that's of high data quality.

Because again, just because you can automate and bring information into a warehouse doesn't mean it's good data. And that's one of the things that we've been working through on our journey so far is working with different parts of the business to say, hey, this information is not complete or it's not accurate. But again, for me, it's also getting the management buy-in. So getting the leadership team to sponsor good quality data and seeing what the benefits are, because I can't give them access to meaningful data from a PowerBI dashboard or whatever we might use, they make a decision on that and then not have confidence and reliability of the information that they're using.

Matthew Stibbe (08:51.726)
A few things I want to unpack in that. Can we start with naming conventions? I'm fascinated by that. What have you learned about naming conventions that you're going to apply from now on?

David Howell (09:06.571)
I mean, we've done it relatively simply. So for example, take Salesforce, for example, we've got multiple instances of Salesforce production sandbox. So again, we generally use the object name. So we'll go crm .prod .accounts or crm .prod. So at least we can distinguish.

Where it's fallen down as we've gone and done some temporary data where we bought in, you know, a quick data set. So for example, an updated customer list. And then data management hasn't been that great. So two months later, it's still there. So again, it's the housekeeping and keeping on top of things. But we've got a limited number of data engineers at Chugai, so it's not a huge problem. But yeah, we need to get better at that.

Matthew Stibbe (09:53.422)
You don't need to legislate the taxonomy so much as just make sure everyone has a shared understanding of how to do it.

David Howell (09:58.187)
We are, that's purely one of the lessons that we've learned because when we then come to look at migrating to new warehouse, what information is needed. I think one of the challenges that we picked up recently is around things like obviously we've got the SQL database, databases and tables and views. Actually, where are they not only coming from, but where are they going to?

So for example, in PowerBI where we've got multiple dashboards with multiple reports, actually which one is driving which report? So if we make a change to one, what is the downstream impact? And if we change a source system like CRM and we add a new object that we need to bring in, how does that impact the existing tables and databases? And yeah, so it's quite, yeah, I think it's not overly complicated, but it's just, you need to have some rules.

Matthew Stibbe (10:50.094)
You need to model or recall or remember where those data sources are and what the downstream so that you can manage the knock-on effects of change.

David Howell (11:00.843)
Yeah, I think it's just configuration documents. I mean, I work in IT, I hate documenting stuff. I'm not great at that.

Matthew Stibbe (11:05.87)
IT people are just slightly better than salespeople at doing their documentation, right? Just slightly.

David Howell (11:12.363)
Yeah, slightly maybe. But no, it is important. It's like, you know, when you code, you've got to put comments in because otherwise you go back to it six months later and you can't remember what that function does or what that part does. So yeah, it's, it's important, but again, we share it with other people that aren't data engineers. So the guys in the business that do the business insights, they need to understand how we present the information. And again, when we have conversations and we're looking at bringing in new source systems to the warehouse, again, we can understand from them specifically what fields are important to them, what they want to see in their views so they can build their analytical dashboards based on the information that we're presenting and likewise if we need to do any transformations, if we need to do any other processing on that data, again it's important to know that.

Matthew Stibbe (12:01.166)
So configuration files, configuration management and naming conventions are really important part of data quality. But what else has an impact on data quality? What are you learning about that? Garbage in, garbage out, right?

David Howell (12:13.643)
Users. 

Matthew Stibbe
Garbage in, garbage out, right?

David Howell
Essentially, yeah.

And flagging that early because we had a couple of incidents where... You know, might have to do annual reporting for legislative reasons or regulatory reasons. And then we get to the end of the year, run the report and we find that we're missing some key information. So again, looking at trying to automate that throughout the year. So maybe sort of weekly reports or whatever. So again, we can flag it and say, yeah, we're missing this piece of information for this particular record. Can we go back and find it at the time? Cause it's a lot easier asking the person that's just submitted it than going back in a year's time and saying that you know you've done three reports last January and you're missing you know customer ID or something. So again that is something we're looking to sort of strengthen our data quality and data management.

Matthew Stibbe (13:09.806)
And the third thing I wanted to pick out from what you said earlier was management buy-in and getting them invested in data quality and data integration. Can you tell me a challenge that you've had to work with around getting that and what did you do to sort of bring management on board for this?

David Howell (13:27.691)
So, well, I mean, there's quite a few things, but as I said, we're looking at digital transformation strategy. So AI, machine learning all needs data. So we need to feed it with good quality data.

Management buy-in is really important because it takes effort and you need people to be on board to say, I need you to spend extra time to make sure that the information is accurate, it's correct, and it's complete. If we don't do that, then any of the other benefits that we'll get from using some of these new tools and technologies won't exist, or it'll be in limited capacity, or worst case, it will be factually incorrect. And that goes back with all sorts of even common data reporting and analytics. If you're missing records or they're not complete, you're not going to get the full picture at the end. So I think convincing people just how important having the right data. And again, there's things we can do, like make mandatory fields, we can make dropdown lists, we can do certain data checking and on the user input side. But yeah, it really, it needs everybody's buy-in and I think leadership teams, certainly here at Chugai, they're quite supportive that we have the right data because it is now starting to drive business decisions.

Matthew Stibbe (14:46.67)
And that user experience, user interface designed to make sure that you get the good data in, using drop downs rather than text fields, for example, are there tools or processes you can use to really look at that part of the journey?

David Howell (15:02.379)
Well, we've kind of probably done it retrospectively because we've looked at where we've got the data quality issues and then gone back and fixed it. So I think going forward, it's just planning that. So when you launch new systems or whether it's a finance system, whether it's a sales marketing system, again, it's just understanding where your potential data risks are. And for some points, it might not be, I mean, having drop down lists are also limiting because, I mean, for example, if you've got a free text box, now you can glean quite a lot of information from having that text and running it through natural language processing. Whereas a drop down box, it's either A, B or C or it's yes, no, or yeah, there's pros and cons to both, but again, it depends what the output is.

Matthew Stibbe (15:51.694)
Do you think we might be, this thought just occurred to me, we might be at the verge of thanks to ChatGPT and large language models, the tyranny of structured data might be broken. In other words, you could just run unstructured text data and go, okay, tell me what countries, I've got 5 ,000 records in here and everyone's typed in United States, US, USA, and you go to ChatGPT. Just normalize that data for me.

David Howell (16:19.595)
I'm not sure if we're quite there yet. I don't know how it would integrate in part of the pipeline or the process. One of the things that we've also worked on, I've got colleagues, data scientists, he's been looking at sort of probability matching. So we will get some text fields like first name, last name, for example, and it will look at connotation. So like, for example, David versus Dave, and then we'll come up with a probability of matching that.

Matthew Stibbe (16:44.91)
That would be amazing, wouldn't it? If you could do a kind of deduplication is something I deal a lot with, for example, of contact names in CRMs. And if you would go, okay, Dave Howell, David Howell, and they both got Chugai. Okay. That, but not on a deterministic basis, but actually sort of probabilistic based on information it might have about you in the system.

David Howell (16:51.498)
Yeah.

David Howell (17:05.803)
Exactly, and that's what we're looking at at the moment because we've got some matching systems. But yeah, there's still going to be a human involved to make the final decision at the moment because we're literally doing it as a proof of concept. But yeah, I think there'll come a point where once you've proven that the process works and, you know, it's a reliable, and again, it may be, it'll come back to iterative training and looking at the outputs. But again, it's data quality. You mentioned it before, it's like, you know, with ChatGPT, you can get an output, but you need to have confidence that it's giving you the right information and it's reliable. But, you know, as I said, you get the same with Google, you know, you type in a search, depending on your search query, you'll get probably the first three and you think they're fact, is what we've been trained because we've built that confidence over 20 years.

Matthew Stibbe (17:54.03)
Yes. Google is never going to come back and say to you, yeah, I don't know about that. It's just going to give you an answer. I mean, okay, I suppose you can put some search queries in that get no answer, but they're not going to be meaningful searches.

And being able to rely on the data and being able to rely on the answer just because it's giving you one. When I was a very young computer games designer, I had a dot matrix printer and I had a laser printer. And I always used to do the first draft of proposals on the dot matrix printer because it looked a bit scrappier and crappier.

So that when the client gave me feedback on something and I did it on the laser printer and it looked really sharp and really good, they had a perception of improved quality. But there wasn't actually any particular improvement. It was just the better output. And I think there's a lot of these systems can persuade us or convince us that they're right just by looking good or giving well -structured or reasonable sounding answers. It doesn't mean they're right.

David Howell (18:53.771)
Yeah.

David Howell (18:57.323)
Yeah, I think you still need to be a subject matter expert. Yeah, yeah, even, you know, there's claims that it's gonna replace doctors, it's gonna replace lawyers and, you know, these highly skilled professionals. At the end of the day, you're still gonna, you know, probably need that person to, it might accelerate the way they can make decisions, but I think you're gonna still rely on, certainly for the next five, 10 years minimum, somebody to actually sit there and make the call, because they need to be accountable, you know. Open AI, they're not gonna be accountable, you know, for you in court, are they? Or...

Matthew Stibbe (19:27.118)
There's a very good reason why they put the disclaimer on every output, right? It might not be correct. Back in the day when I used to write for quite big American magazines, they had teams of fact checkers. You'd write something, and I remember I interviewed somebody at the German embassy about something. They rang him up and said, did you actually say this? And did you say that? That discipline, that rigor is going to become very valuable. Anyway, interesting, but let's come to a specific project that you've worked on and explore that together. So can you give me an example of a data transformation, digital transformation project using sort of connecting systems that you've worked on and tell me a little bit about that.

David Howell (20:07.499)
Yep. Okay. So probably the most recent one that I've worked on is actually using CloverDX. And we've got two systems. I mentioned Salesforce and Marketo earlier. So we've got CRM system and we've got a digital marketing automation platform. So Marketo. The challenge that we've got in our CRM, we're using a CRM that's been configured by a vendor and everything is an account.

The challenge that we have when we integrate with Marketo is they're expecting contacts as people. So CRM is configured so we have a contact object for everybody, but all of the updates, all of the fields are driven on the account. So it's got a person account and it's got what you would traditionally think of as an account, like a company, maybe a hospital or wherever else, however you carve up the NHS in say the UK.

The problem with this is when we're trying to collect email addresses or consent for people, that information is stored on the account object and not the contact. So yes, accounts will flow through to Marketo, but it's not expecting that to be the transactional account. It's the contact or the person at the account is the person that we're gonna be using within the marketing tool. So what we had to do, what we've done with CloverDX is we've built a fairly cool program that will basically look at all the accounts in CRM that have got an email address. It will then bring back the account object values. It will then compare it with the contact object values and it will work out whether or not an update for consent was done specifically in Salesforce or it's been updated in Marketo and then that value has flowed and updated the local contact value in CRM.

So again, we're using various things like so we're using that warehouse to compare against previous value and it will decide which consent value is the most accurate and up to date. And this has been pivotal because without this, we couldn't have launched our marketing automation tool, which is a huge strategically important project for us at Chugai.

The other thing is we're processing this on a five-minute schedule. So again, without us using CloverDX, there's no way we could have really automated this to the same level of customer experience that we've got. So if someone goes into CRM, change the value, we send them an email within five minutes. Previously, we might do it on a daily batch file or something like that. So again, if you've spoken to one of our sales representatives and said, can I have permission to send you marketing emails and they say, yeah, here's my email address. And then a day later you receive an email saying, thank you for opting in. It doesn't look particularly professional. So again, it's really helped us, but it gives us that confidence. So when we look at things like our responsibility towards GDPR, we can be a hundred percent sure that we are capturing consent. It's reliable. And again, we've done thorough testing. So we've got that assurance and accountability that actually.

Matthew Stibbe
It's consistent across multiple systems so it's not you're not still being sent emails from one thing but not the other one or something like that.

David Howell (23:16.459)
Exactly. And one of the other challenges were, you know, even if our consent database primarily lives in Marketo, because our CRM system, you know, at the time, if it hadn't been updated, they could log into CRM and see, I've got marketing consent because it hasn't acted or the opt out hasn't been acted on. Then they start sending some emails from Outlook or another third party tool. And again, that really is not where we want to be as a, from a data quality perspective, we need to make sure that all of our systems are streamlined and synchronized.

Matthew Stibbe (23:48.75)
What was the biggest thing you learned from that integration project?

David Howell (23:52.619)
Understanding probably the problem, probably the first thing, and then understanding, you know, technically how the two systems work, what fields they're looking at, what the synchronization time between them, and then, you know, I've done application development, software development before, so it's relatively, once you've got that mindset, it's quite easy to come up with a solution. But yeah, for me, that was probably...

It was quite an easy thing to solve in CloverDX once I'd got an understanding of what the problem was and what the issue was. We did quite a lot of dummy testing and we did it in sandbox before we launched into production and things like that. So again, it was not a particularly long project, it took us a couple of weeks. But without that tool, we wouldn't have been able to deliver it.

Matthew Stibbe (24:41.582)
Understanding the problem is always the hard bit, I think. You can easily solve the wrong problem very efficiently. So we're almost out of time, but I'd like to ask you one final question before we wrap up. If you had one tip based on your experience to share with somebody who was starting out on a data integration project, what would it be?

David Howell (25:04.043)
I think for me, I've used multiple tools over the years. It's really around getting the right tool to fit what you're looking to do. I think I'll talk very briefly about CloverDX because I've just mentioned it, but when we were on the market looking for a new ETL tool, I wanted something that was powerful enough to do what I wanted. But again, I could start to use it quite quickly. For me, it wasn't as intuitive as some of the tools in the market, but for me, really, once I got the support from CloverDX and we started to build some of these automations, actually, I really understood the power of the tool and some of the extensibility. So we're doing some stuff with Python, we're doing some stuff with third -party Java connectors and some stuff like that.

So for me, it's understanding what you're trying to achieve. If you just want to connect two systems together, you could do something like Zapier. There's a whole load of really simple tools that connect the source system to a destination. But when you want to build some things like the marketing automation synchronization where you need to do logic, you need to do proper transformation, that's when you need a tool that will stand up to that. And I think just for me is the flexibility of the tool.

Matthew Stibbe (26:23.542)
Yeah. Chainsaws are powerful tools if you learn to use them safely and well, right? You can't get the same result with the Swiss Army Knife. At least I can't in my garden. Well, thank you. Thank you for mentioning Clover. That was very sweet. And that brings this episode to a close. If you'd like to get more practical insights about data and data engineering, learn more about CloverDX. Please visit cloverdx.com/behind-the-data. Thank you for listening and goodbye.

Share

Download and listen on other platforms

Subscribe on your favorite podcast platform and follow us on social to keep up with the latest episodes.

Upcoming episodes

Get notified about upcoming episodes

Our podcast takes you inside the world of data management through engaging, commute-length interviews with some of the field’s most inspiring figures. Each episode explores the stories and challenges behind innovative data solutions, featuring insights and lessons from industry pioneers and thought leaders.