Listen to this episode:
About this episode:
In this conversation, James Courtney-Smith, a solutions consultant with Lucid Data Services, discusses current trends in data engineering and technology. He highlights four areas of interest: AI and its potential, the importance of data governance, data security and privacy, and the need for effective data visualization. James also shares his role as the product owner of Market Business Glossary, a centralized resource for accessing data standards and reference data in the insurance industry. He emphasizes the significance of data ownership, stakeholder management, and education in successful data projects.
- AI is an emerging technology that has the potential to transform various industries, but its true capabilities and impact are still uncertain.
- Data governance, including security, stewardship, ownership, and data quality, is a crucial aspect of effective data management.
- Data security and privacy, especially in the context of GDPR and ownership of personal data, are important considerations in the data landscape.
- Effective data visualization plays a significant role in presenting information and insights to the correct audience.
- In the insurance industry, having a centralized resource like Market Business Glossary can provide access to data standards and reference data, ensuring consistency and quality in data usage.
- Successful data projects require a focus on data ownership, stakeholder management, and education to align business needs with technology solutions.
AI-generated transcript
Matthew Stibbe (00:01.29)
Welcome to Behind the Data with CloverDX. I'm your host, Matthew Stibbe, and today I'm talking to James Courtney-Smith, who is a solutions consultant with Lucid Data Services, and he's working at LIMOSS. Great to have you on the show, James.
James (00:14.101)
Hello.
Matthew Stibbe (00:17.93)
So before we get into your world of data and start exploring that a little bit, let's start with a really practical question. What current trends in data engineering or technology are grabbing your attention at the moment?
James (00:31.63)
So at the moment I'd say there are probably four areas. The first one is AI and the use of AI with data. We need to understand whether that this is currently the buzzword. There's a lot of information about AI out there, but does it actually do what we're expecting it to do? It's an exciting world, but we need to be clear that this is an emerging technology. I like things like big data in the past.
Are we using it to its full potential? Are we understanding it correctly? And is it going to fit our purpose? A box doesn't exist out there that we could just plug data in and use AI to understand that information and spit it out. There's still technology that goes in there and our business leaders actually understanding this is a requirement and not only the scope, how it could transform the way we deal with data, but what needs to go in. It's not a easy pill.
And that's what's capturing my attention, how we use this and how we relay those business cases forward. And the second thing I would say, I'm a devout follower of data governance. My background's predominantly been in data governance. And there is a slight concern of me that we're losing sight of the whole idea of the pillars of data governance, things like security, stewardship, ownership, data quality.
Do these still exist like they did five, 10 years ago when data governance was really dominant and people were actually talking about this? The solution that we work on is predominantly around data standards and having a single source of data and understanding our data, the metadata about data. Are we losing sight of that and the importance of that? The other areas that's capturing my attention is around data security. Is that as dominant?
Are we aware? When we read about data security, we talk about more database systems and hardening them and protecting it. But are we mindful of the security around data during transit and rest and the process that we do there? Privacy as well. In a post -Brexit world with the idea of GDPR, although GDPR looks like it's here to stay, are we still be mindful of that? And who owns our data as an individual or as a consumer my data is my data, but is it? Is the ownership of data and the use of this data being taken over by the larger corporations, social media world, who owns that data? And the final point is visualization. We seem to be rocketeering into a new age of an information world. But how are we presenting information around data or visualizations of data to the correct audience?
Are we, are the days gone of presenting a PowerPoint presentation with graphs or do we need to make that data more, all that visualization more accessible and more fun, almost like the gamification of visualization. That's another area of interest that I'm quite keen to explore and just following how the market reacts to these trends.
Matthew Stibbe (03:41.834)
It's like so many things, it sounds like there's the work, but actually the work around the work, the visualization, governance, security, and so on, the meta work, if I can coin that phrase, is actually as interesting and as challenging as actually the data itself.
James (03:58.062)
Yeah, absolutely.
Matthew Stibbe (03:59.85)
Do you think AI is going to be a problem or a solution?
James (04:05.23)
I don't know to be honest. It's so easy to jump on the bandwagon and say, yeah, it's going to solve all our problems or take the opposite view. And are we going to have, are we going to enter a matrix age and be governed by computers? Whereas my, I think we talk about the wall -e, if you know the Pixar film. I don't think there's enough information or data to make a solid view.
You, at the start of trends, you get the dispersion arguments, you get one side here and one side here. We need to take a step back and actually observe this. I think it could do incredible things in the world of science and medicine. And I think that would foster a really good relationship with the human race. I think the areas of concern are where it's used to exploit people rather than aid people. And I think that's probably where the concerns come in such as the mining of information through, say, social media, for example, trends or GPS, all that kind of stuff. I think that's where the dangers lie. If it's not used correctly, then we could have potential problems. But at the same time, there isn't such a... I don't think there's been such a compelling case to show and demonstrate how this is going to transform civilization at this stage. It's just hypothetical thinking. Maybe 100 years time, 50 years time.
Matthew Stibbe (05:35.722)
There's quite a long, big gap between chat GPT, for example, and science fiction. It feels like science fiction, but kind of the sci-fi future of, you know, both apocalyptic and utopic. I for one welcome our new AI overlords, by the way, if anyone's listening. All watched over by machines of loving grace and all that.
So tell me about your world of data. What specifically do you deal with at LIMOSS? What tools do you use to just introduce us to your world a bit?
James (06:13.134)
So I'm specifically the product owner of an application called Market Business Glossary. That was formed, it came out of a transformation project in the London insurance market called LM Tom. And the purpose of the Market Business Glossary was to have a centralized resource for all market participants varying from massive carriers or managing agents to smaller what are known as cover holders, to access a shared data standard and reference data.
So a user could log on to our application. We get data standards or data sources or reference data from various places. And we put it in a centralized place where a user could call on it, access it, and understand any data requirements. So this could be an example such as reference data.
Reference data for people in the data world appreciate that finding a definitive source of reference data is really challenging. Take for example, country, county codes, UK county codes. UK counties don't exist as the data perspective. You get ceremonial ones, historic ones, unitary bodies. So what we do when we want to capture county information, where do you go to for that definitive list?
Fortunately, we provide it through our application, the Market Business Glossary. We use ISO subdivision codes to provide that information. But that's more of a generic that can be used across industry. For our perspective, we source data that could be used across the industry, but also insurance specific ones. So a user could come along and download an up-to -date list of catastrophe codes or risk codes that Lloyds of London might release.
We are in partnership with the data standard agency. We release their reference data on their behalf. We publish it and allow the user to log on to our application, download this data or we're close to doing a release of being able to call that data via APIs. Now, the reason that's so important is my insurance world or the London market industry that I work in, we can have a defined set of standards
James (08:31.438)
that people use, people share, people can validate against, so everyone is using the same set of data principles. Going back to my earlier points around data governance and data quality, this is important. If you imagine you've got a database over here and a database over here, feeding information to each other, single set of standards is so crucial to this process because you can make sense of your data, your quality of your data is going to be, or the integrity of your data is going to be there because you're using the same values as the data flow throughout the system.
So, yeah, that's in essence what I do. We provide a login, a nice UI where a user can come along, structure their data requirements, their metadata requirements, and they could develop their organization's knowledge and usage of a single data standard and shared data knowledge, really.
Matthew Stibbe (09:29.866)
Forgive me if this is naive or wrong. It sounds to me like you, James, are the librarian of data about data in the insurance world.
James (09:37.006)
Yeah, we like to think that, I think that's a good way of putting it. There was a whole point, about a year ago we deployed a Wiki style version of an insurance standard which provides all the data requirements within it. So we like to think that yes, we provide a wealth of knowledge that's a single source of truth that a user could easily access and easily extract these data requirements.
Matthew Stibbe (10:05.706)
What was your journey to the role that you're in now? Tell me a little bit about your background.
James (10:12.238)
So I came to the, I'd say IT world or technology world as a test analyst quite a few years ago with an insurer. And then I went into a niche of data quality and pure data quality. So validating data requirements from backend systems, making sure that the data was fit for purpose in terms of its accuracy, its completeness, it's validity. And then from there, I moved into more of a data governance role in the charity sector. Things like, I was doing things like creating data dictionaries, building validation rules, building data requirements. And then I jumped around through various sectors. I was in financial services, local government, media. And I, like I said, at the start of the conversation, it was more data governance. I was in the area. It was with my key area.
And then I came into the role in Lloyds of London and that's where the myself and a couple of colleagues would formulate the idea of building this data dictionary, data catalog, data library, as you put it, and subsequently built an application from it. It's important to note that there are applications out there, the likes of Collibra or Semanta or Data360 that have this catalog function.
However, we had a slight issue that with the application that we managed, Market Business Glossary, we needed an application that would be used by up to 50,000 people. There don't seem to be any off-the-shelf products like this. So what we did is we designed our own application.
We put together best applications out there, best breeds. So we've got a front end that's written in Vue.js, a back end that runs off a CMS called Magnolia. We use Tableau for visualizations, but importantly for this conversation, we use CloverDX for the ETL within this. So all our APIs are published using CloverDX. We do some transformations with CloverDX and basically the way that we flush
James (12:31.374)
data throughout our systems of this metadata management application is by using CloverDX. So it works nicely with this conversation on why we're discussing data at length.
Matthew Stibbe (12:44.298)
Well, thank you. Nice plug. I appreciate it. Thinking about this, this if somebody listening to this is involved in building data dictionaries or thinking about, you know, the metadata, what are the most common failure modes or problems they need to be mindful of and to avoid?
James (12:46.382)
I would say the biggest point is data ownership. Who's your business leader? Who's going to agree to this standard? You may say, and just purely on a simple level, you may build your data requirements. You may want to capture a title and you may think it's acceptable to do free text. But when you do free text, you could put any old junk in there. You could put numbers, letters, whatever you want. But what is the purpose of that field?
So by having an agreed standard and agreed support from the business owner, you define that you want to capture Mr. Mrs. Master, whatever you want to do. But you have got the stakeholder management, you've got your stakeholder backing of capturing these requirements. You've got the business case and you've got the rules and the requirements to process that data.
Maybe Mr, Mrs, Master's is not the best example, but whatever field you're capturing, you've got to have the support of the stakeholders. You've got to understand the reason for capturing this data and the purpose of what that's going to allude to further downstream in your data cycle.
Matthew Stibbe (14:16.042)
And how do you extract that knowledge or how do you enlist the help of business owners to help you define those things? And how do you stop them gold plating it?
James (14:29.23)
I think it's explaining the use case, the requirements, setting out what I just said, the purpose of capturing this data, what do you want to do with it, why do you want that, and how should this be? You've got to put forward the strength of validating the data, why you need to validate the data, you've got to put forward the case that you're going to be capturing data for no purpose if you don't put any rules or standards or governance around it. So I think it's...
And back to the point of explaining about the AI, we are in an information age. We require data to provide the information. And without setting out those parameters, then you can't get to that end state. If you relay that to the business, the business is maturing and it's fast maturing. It's understanding the purpose for this. On the downside, on the negative approach to that, it's...
There's no point capturing this information. The information is going to be dud if it's not of good quality, if it's not fit for purpose. It's going to be a wasted effort. And again, explaining that to the business, the business comes around to, the business makes sense of it. The business is always operated or, say, always mainly operated by intelligent people. So.
Matthew Stibbe (15:48.778)
Well, I'm not a data engineer myself, but I come across, you mentioned this counties problem. I come across this in my world where people have captured, for example, country or a US state names into a CRM database or marketing database. And they want to then start segmenting their data, but they've done it in a text field. So you have US, USA, United States, United States of America, America. And, you know, you can say, well, normalizing that data for 50,000 records is actually going to take quite a lot of time. If you'd captured it the right way.... So what you're saying, I feel like I've got a slight little sort of window open into your world from my experience.
Let's explore some specific project that you've worked on. Can you share an example of a particularly challenging or sort of evolving project and how you made it work? Can you give me an example and then we'll explore that.
James (16:42.541)
Yeah, I think I'd say probably the process that we're close to going live with is reference data for APIs. So reference data, as we said before, is incredibly important. Where you can, use reference data because it captures the information that you need. You've got a managed list. You can manage the process of data as it flows along and validate the data with a shared list.
So what we identify, in our application that we manage, Market Business Glossary, we publish about 500 reference datasets. These, like I said, everything from ISO country code to insurance specific to Lloyds catastrophe codes or Lloyd's risk codes. Now what we realize that the more the information age grows, the more system to system integration there's going to be. And the more we could support the system to system integration by publishing a defined data standard in a simple way. And I'll share an example with country codes again. So if a country or a subdivision changes its name, who's going to update that data? You'd have your DBA, your DBA, you know, it might be a service request.
Matthew Stibbe (17:59.594)
like Burma to Myanmar or something.
James
Exactly. So you may have your organization may have to raise a service request, have a database updated. It may take time, it may take money, or you may have a database that is out of date and it's not relevant. So you're trying to put your address in, you've only got Burma as an option, you haven't got my Myanmar. And again, that's incorrect data.
James (18:26.286)
By enabling these reference data APIs, we could change the data in a single place. And when the APIs are called, it would have the most valid source of data. So your system or the end user, their application could be updated overnight. And this is probably a better example with Lloyd's catastrophe codes. So as you're probably aware, and most of the viewers would be aware, Lloyd's plays a central point in the worldwide insurance industry.
Now, Lloyds produced something called a catastrophe code, which if there's a big event or a big disaster, natural or manmade, Lloyds would record a catastrophe code. So any claims that are paid out against a catastrophe can be easily processed. So what we do is we take that data, it's only available via the public domain on a webpage. We take that data, we load it into our database and then we promote it to our production environment from which an API would instantly take the new value and be available for users to consume. So if you imagine a hundred different insurers calling our APIs, every time a new cat code is released, they have it instantaneous or overnight or whenever they call the API.
So from earlier from your question, so the exciting project that we're doing, is we're building the API capabilities for every reference dataset that we source and we publish via the glossary. It's instantly created as an API, so it could be instantly consumed by a third party or a third party application. Even if you just call these APIs from a data quality piece of software to validate your databases, or if you use it to validate your systems or use it to automate your systems, the data is available, we've cleansed it, we've validated it, and we've checked it. So you know that you're going to be getting well-sourced, good quality data. So to cut a long story short, that's the project that we're about to release in production. It's quite exciting because it automates the process of reference data management.
Matthew Stibbe (20:37.002)
Which sounds very meta meta to me, but as you were going through that project, what was the biggest learning for you? What would you have done differently? Or what did you wish you knew when you started?
James (20:51.886)
I wish someone built our application beforehand so we have a single source of truth for accessing such data. One thing that was quite apparent is data tends to end up, and maybe its the evangelical in me, when you're accessing data from third party, it comes in all shapes and sizes. Probably most data has been created or reference data has been created for a specific purpose, even though it could have wide-scale applications. It would have been nice many years ago if someone came up with the idea of, let's do reference data management and somehow standardize every type of reference data, which is obviously near on impossible.
Other things that I notice is data quality, as in data quality, I would say, still in its infancy in a lot of industries. So standardizing date time formats or the user's knowledge of illegal characters, white space within data, things that could cause problems and corrupt data downstream. I wish there was a wider knowledge and a wider knowledge about that. But in terms of this project itself, I think we, I'd say we as Lucid, my colleagues and myself, we've built very good processes to validate the data. And I think it was quite a clean, clear way of doing it.
So if I would have done things differently, I would have gone back in time and put a business case of data quality from 30 years ago. But I think we did everything quite well.
Matthew Stibbe (22:36.298)
We, well, there you go. We had on our homepage at Articulate Marketing years ago on the banner, everything is possible except love and time machines. I always think time machines would be my number one invention if I could figure it out. So we're almost at the end of this interesting conversation, but I'd like to ask you one last question. What advice would you give to any team or any manager embarking on a data project? And particularly about getting the data modelling right.
James (23:12.014)
Press, it's going to sound a bit of a politician's answer, but press for maturity.
Matthew Stibbe (23:20.938)
At the time of the election, I think you're allowed, right?
James (23:24.078)
Get someone else to do it. It costs 50 million pounds and where would that come from? We don't care. Let's just embark anyway. We promise we'll do it overnight. No, I would say press for maturity.
The business is the people who are making decisions. IT, technology, we don't make decisions. We need the business. The business is the most important thing. Your relationship between business and your stakeholder management relationship is the most important, explaining the business case for data Pressing the business case for data, pressing the business case for good quality, processes, ways of enriching data, things don't happen via a black box.
Like we said with with AI you you don't have a black box of knowledge here that you feed things and it creates it on this side. It needs to be a collaborative approach. The business know the purpose, or know the content of the data better than you do as a technology provider. So their decisions should be respected and understood. But at the same time, if there are gaps in, gaps in knowledge or there needs to be a process of education, that's where your role is as a crossover between business stakeholders and building this, this application at the end. So I would say stakeholder management and education are the most important things with, with projects like this.
Matthew Stibbe (24:52.266)
To educate and engage with and evangelize about data management, data quality to the business owners. I think, yes, that couldn't agree more. And with that sage advice, I think that brings this episode to a close. Thank you. If you'd like to get more practical details about data management and learn more about CloverDX, please visit cloverdx.com/behind-the-data. Thank you for listening and goodbye.
James (25:06.03)
Thank you very much.