Data Catalog, the essential repository of the data-driven company

Data Catalog, the essential repository of the data-driven company

Published date : July 8, 2021

Defining what is a “Data Catalog” is tricky, since most of the leaders and experts expect different uses and features. However, all the Data Catalogs on the market seem to share the combination of two features: the management of business metadata (context, business, solution, technology, etc.) and the management of technical or implementation metadata (operational data storage system, digital application, transactional system, analytical system, etc.). For more information, read our article below.

What are the definitions for the Data Catalog?

Often referred to as a “glossary” or “business dictionary”, this catalog consisted mainly of describing the concepts, terms, and sometimes definitions associated with the business domain and/or the initiative contemplating the use of data. Launched in the 1980s, “data dictionaries” were the first technologies created to collect, store and manage simple information (type, length) about data from Database Management Systems (DBMS)such as Oracle. Following the data dictionary trend, many tools were created in the 90s with IBM (IBM’s Repository ManagerMVS) or with Platinum and Microsoft (Platinum Repository).

These Data Catalogs allowed integration of the product with all or part of the publisher’s products, and they also provided a form of functional extension. In fact, this extension is characterized by its dependence on the use of data, e.g. access control based on a user profile. It was based in particular on the simple fact of increasing the initial metadata of the product with metadata associated with the use of the product, the term repository is often used to designate this type of Data Catalog. Besides, repositories are found very early on in several leading products on the market. They are offered by publishers who have distinguished themselves by their ability to handle metadata and are perceived as pioneers in areas such as business intelligence, data integration (e.g. Business Object Repository, Informatica Metadata Manager, etc.), and others.

Their respectively innovative approach to metadata control valued the use of the product more than the product itself, willingly recognizing the value in the use of its data more than in its data itself, both for themselves in the context of a product innovation perspective, but also (and especially) for that of their customers, and this through different types of initiatives (360 vision, customization, regulatory compliance …). Thus the use of the data dictionary mixed with these new technologies have extended the definition of the data dictionary, to make it a system indexing business, operational and system metadata.

Example

In particular, we can mention:

• Definitions and descriptions of business data

• The origins and source of operational data,

• The use of system data to find out how the data is used by the organization’s tools.

This constitutes in large part the first definition of today’s Data Catalog. However, difficulties arose in managing and updating the metadata, which required time, money, and a clear and organized process with centralized data management. We can also consider that the culture of organizations was not yet sufficiently Data-oriented to engage in such work. It is with new technologies that we can talk about automating metadata updates and automatic discovery of new data sources.  

Thus, it is thanks to these solutions currently available on the market and a powerful culture around data in companies that we can speak today of a new generation of Data Catalog.

Definition from the market

While the solutions currently on the market offer similar functionalities, there are still many features that differ between each solution.

We have listed 6 proposals of definition by the editors and market experts below:

Most market leaders and experts consider different usage perspectives and characteristics for the Data Catalog, which makes any attempt at a single definition tricky. Nevertheless, when we look at their respective definitions, we notice the emergence of two types of recurring functionalities.

The first one concerns the collection of metadata associated with the use of data considered in a specific context (business, solution, technological …), which is a use that does not depend on any form of implementation.

The second functionality provides the collection of metadata associated with the use of data in its implementation context (operational data storage system, digital application, transactional system, analytical system…).

Astrakhan’s definition

As we have seen in the previous chapters, there are several definitions of what is now called a “Data Catalog”.

So for the rest of the document we have decided to write our own definition of the Data Catalog. We will use this definition for the rest of the document.

The Data Catalog is a data repository that captures the business context for the enterprise.

It can be an application or an application assembly composed of a business process modeling module around data, a data integration layer, or even a search engine.

Companies are using it to:

• Make an inventory and organize the data available in their system,

• Centralize and index business terms and technical data

• Track data and enable them to control the data life cycle

• Link different levels of data modeling,

• Enable data search using business vocabulary.

It can also be used to operate management rules adapted to different categories of data.

Through the use of the Data Catalog, these functionalities enable companies to make the most of the value that data brings.