Data catalogs are on fire in 2021. The number of entrants into the space is increasing, and there seems to be a tremendous demand for adoption.
There is a lot of variation in what is expected of a data catalog. Yet, even if there is such variation in expectations, surely there must be some fundamental, common idea of what a data catalog is meant to be. Well, we all know what data is, so it is the “catalog” part that we need to look at more closely. And the thesis of this article is that, if we do that, then we are going to find some eerie similarities with Master Data Management (MDM — particularly Product MDM.
“Catalog” is defined by the online Merriam-Webster dictionary as: a complete enumeration of items arranged systematically with descriptive details.
The items are stored in a data catalog are usually termed “data assets” or just “assets” — for want of any better term. The items are not the data assets themselves but are the metadata about them.
Now, data assets are things. This is not the place to get into metaphysics, but the general reality we exist in principally consists of concepts, things and events.
In a database, these correspond to Reference Data, Master Data and Event Data, respectively. From a metaphysical viewpoint, “things” are bearers of properties — they have identities, attributes, relationships and can change over time. In the world of data management, they are usually entity types like Customer, Product, Financial Instrument and so on.
So, the conclusion seems to be that:
Data Assets are another Master Data entity type.
You could try to stretch this to say:
Metadata is Master Data.
However, some metadata could be Event Data, in which case this statement would be unjustified. But for the rest of this article, to keep things simple, let’s use the term “metadata” to mean just the information about data assets that is stored in a data catalog.
If metadata is master data, it is legitimate to ask if any lessons can be learned from MDM about how we should manage metadata. Where do we see catalogs in MDM? The answer is that they are closely connected to Product MDM. Product Catalogs are often (but not always) a component of an eCommerce site, along with functionality to, for example, purchase the goods being offered.
So, at this point, we have a good reason to ask if there are any useful insights that data catalogs can learn from Product MDM and from Product MDM catalogs in particular.
Product Life Cycle
Products have a life cycle. A notional example might be:
Ideation > Design > Prototype > Testing > Manufacturing > Discontinuation of Manufacturing > Discontinuation of Warranty and Support
These phases have to be reflected in the automated support that Product MDM provides. If our analogy holds, then a data asset, like a dataset of SQL query, should follow some kind of life cycle that the data catalog will help to manage.
An even more profound conclusion from thinking about this life cycle would be that data is a product and should be treated like a product. (We will leave that one for another day.)
Product Taxonomies
Taxonomies are huge in Product MDM. Basically, they are the ways in which products are grouped together.
Taxonomies serve two fundamental purposes:
- To help the enterprise govern, manage and report on its universe of products
- To help customers explore the products they are interested in and find the products they want to buy
It seems that any data catalog is going to contain a diverse array of different data assets, or, perhaps, if we want to go further — data products. Therefore, taxonomies are going to have to be taken seriously for data catalogs, too. There is a large body of work on product taxonomies and it is very likely that we can learn a lot from them that can be applied to data catalogs.
Differences between Product MDM and Data Catalogs
So, we have at least a couple of areas from Product MDM that we may be able to take lessons from. However, we need to acknowledge that the analogy may eventually break down and that data catalogs are, in some ways, unique.
One difference is that Product MDM is about Product Types, not instances of products. That is, Product MDM is ultimately about the types of things that have to be managed, not the individual things themselves. A data catalog will, in large part, be about individual assets, like schemas, tables, columns, datasets, queries, and so on. That is a big conceptual difference that is likely to have implications.
Another example where there might be a difference is the use of Item Master teams in Product MDM. These are centralized teams that set up the Item Master Record — the key information for a product — in a Product MDM system. They make sure the data is complete and accurate and assign all the correct taxonomies. It is difficult to see how this methodology fits with the data democratization that a data catalog is intended to support.
Conclusion
There are definitely close parallels between a data catalog and Product MDM, and there are some great ideas we can glean from the technology and art of Product MDM. However, there are differences too, and we will still have to slowly think our way through the challenges that are unique to data catalogs in order to achieve the vision that has been set for them.




