What Is a Data Catalog?

January 9, 2021

In This Article

What Is a Data Catalog and Why Do You Need One?
Data Catalog Definition and Analogy
Challenges a Data Catalog Can Address
Data Catalog Users
Data Catalog Use Cases
What Is Needed to Fully Make Use of Data in a Data Catalog?
What Should a Data Catalog Offer?
Why Oracle Cloud Infrastructure Data Catalog?
Conclusion

What Is a Data Catalog and Why Do You Need One?

Simply put, a data catalog is an organized inventory of data assets in the organization. It uses metadata to help organizations manage their data. It also helps data professionals collect, organize, access, and enrich metadata to support data discovery and governance.

Discover OCI Data Catalog

Data Catalog Definition and Analogy

We gave a short definition of a data catalog above, as something that uses metadata to help organizations manage their data. But let’s expand upon that with the analogy of a library.

When you go to a library and you need to find a book, you use their catalog to discover whether the book is there, which edition it is, where it’s located, a description—everything you need so that you can decide whether you want it, and if you do, how to go and find it.

That’s what many object stores, databases, and data warehouses offer today.

But now, think back to the analogy of that library and the catalog. And now expand the power of that catalog to cover every library in the country. Imagine that you have just one interface and suddenly, you can find every single library in the country that has the copy of the book you’re seeking, and you can find all the details you’d ever want on each one of those books.

That’s what an enterprise data catalog does for all of your data. It gives you a single, overarching view and deeper visibility into all of your data, not just each data store at a time.

Perhaps you might wonder—why would you need a view like that?

Challenges a Data Catalog Can Address

With more data than ever before, being able to find the right data has become harder than it ever has been. At the same time, there are also more rules and regulations than ever before—with GDPR being just one of them.

So not only is data access becoming a challenge, but data governance has become a challenge as well. It’s critical to understand the kind of data that you have now, who is moving it, what it’s being used for, and how it needs to be protected. But you also have to avoid putting too many layers and wrappers around your data—because data is useless if it’s too difficult to be used.

Unfortunately, there are many challenges with finding and accessing the right data. These include:

Wasted time and effort on finding and accessing data
Data lakes turning into data swamps
No common business vocabulary
Hard to understand structure and variety of “dark data”
Difficult to assess provenance, quality, trustworthiness
No way to capture tribal or missing knowledge
Difficult to reuse knowledge and data assets
Manual and ad-hoc data prep efforts

Data Catalog Users

All of these data management issues frustrate users such as data engineers, data scientists, data stewards, and chief data officers. All of these groups of people want easy access to trusted data. Here are just a few of the challenges that they face:

Data engineers want to know how any changes will affect the system as a whole. They might ask:

What will be the impact of a schema change in our CRM application?
How different are the Peoplesoft and HCM data structures?

Data scientists want easy access to data and they want to know more about the quality of the data. They are looking for information such as:

Where can I find and explore some geo-location data?
How can I easily access the data in the data lake?

Data stewards are charged with a managed data process. They care about concepts, agreements between stakeholders, and managing the lifecycle of the data itself. They will ask questions such as:

Are we really improving the quality of our operational data?
Have we defined standards for important key data elements?

Chief Data Officers care about who is doing what in the organization. They’re typically not the ones using a data catalog, but they still want to know answers to questions such as:

Who can access customers’ personal information?
Do we have retention policies defined for all data?

Enter the data catalog.

Data Catalog Use Cases

In the past few years, the concept of a data catalog has become popular because of the increasingly large amounts of data that now have to be managed and accessed. Cloud, big data analytics, AI and machine learning have started to change the way we need to see, manage, and leverage our data—and not just manage of it, but be able to fully use and access it.

Using a data catalog the right way means better data usage, all of which contributes to:

Cost savings
Operational efficiency
Competitive advantages
Better customer experience
Fraud and risk advantage
And so much more

Here are just a few of the use cases for a data catalog. But really, a data catalog can be used in so many ways because fundamentally, it’s about having wider visibility and deeper access to your data.

Self-service analytics. Many data users have trouble finding the right data. And not just finding the right data but understanding whether it’s useful. You might discover a file called customer_info.csv. And you might need a file about customers. But that doesn’t mean it’s the right one because it can be one of 50 such similar files. The file may have many fields and you may not understand what all of those data elements are. You’ll want an easier way to see the business context around it, such as whether it’s a managed resource, from the right data store, or what the relationship is with other data artifacts.

Discovery could also entail understanding the shape and characteristics of data, from something as simple as value distribution, statistical information, or something as important and complex as Personally Identifiable Information (PII) or Personal Health Information (PHI).

Audit, compliance, and change management. With ever-increasing government regulations around data, you often need to demonstrate the provenance of data—whether certain data artifacts are coming from this source or that source, or how it’s getting transformed before reaching whatever the final target is. When looking at a table, report, or file, your data users often want to understand where the data is coming from and how it’s moving through the organization in various ways. From a change management perspective, it’s important to view how changes in one part of a data pipeline affect other parts of the system. This is why customers seek detailed data lineage.

Supporting data governance with business glossaries. Most organizations have a vocabulary that everyone agrees on and a consistent understanding that they can use for business concepts. But often, it’s recorded in Excel sheets lying around somewhere—and that’s if the organization is lucky. A data catalog is a much better place where you can store and manage this vital business information.

A data catalog also allows you to establish links between business terms to establish a taxonomy. Beyond that, it can record relationships between terms and physical assets such as tables and columns. It also enables users to understand which business concepts are relevant to which technical artifacts. This can be used to classify data assets along business concept lines and then actually use business concepts instead of technical names for search and discovery. This helps by increasing user trust in what they’re looking at, because they can see everything that’s related to their data and it’s often a good starting point for data governance.

What Is Needed to Fully Make Use of Data in a Data Catalog?

So let’s take a step back and quickly explain metadata to those who might not be entirely familiar with it. What is metadata? There are three kinds of metadata:

Technical metadata: Schemas, tables, columns, file names, report names – anything that is documented in the source system
Business metadata: This is typically the business knowledge that users have about the assets in the organization. This might include business descriptions, comments, annotations, classifications, fitness-for-use, ratings, and more.
Operational metadata: When was this object refreshed? Which ETL job created it? How many times has a table been accessed by users—and which one?

In the past few years, we’ve seen a mini-revolution on how we can use this valuable metadata. Once, metadata was mostly used only for audit, lineage, and reporting only. But today, technological innovations like serverless processing, graph databases, and especially new or more accessible AI and machine learning techniques are pushing the boundaries and making things possible with metadata that simply weren’t possible at this scale before.

Today, metadata can be used to augment data management. Everything from self-service data preparation to role-and-data content-base access control, . Automated data onboarding, Monitoring and alerting anomalies. Auto-provisioning and auto-scaling resources etc.. All of this can now be augmented with the help of metadata.

And the data catalog uses metadata to help you achieve more than ever with your data management.

What Should a Data Catalog Offer?

A good data catalog should offer:

Search and discovery. A data catalog should have flexible searching and filtering options to allow users to quickly find relevant sets of data for data science, analytics or data engineering. Or browse metadata based on a technical hierarchy of data assets. Enabling users to enter technical information, user defined tags, or business terms also improves the search capabilities.

Harvest metadata from various sources. Make sure your data catalog can harvest technical metadata from a variety of connected data assets, including object storage, self-driving databases, on-premises systems, and much more.

Metadata curation. Provide a way for subject matter experts to contribute business knowledge in the form of an enterprise business glossary, tags, associations, user-defined annotations, classifications, ratings, and more.

Automation and data intelligence. At the data scales that we mentioned, AI and machine learning are often a must. Any and all manual tasks that can be automated should be automated with AI and machine learning techniques on the collected metadata. In addition, AI and machine learning can begin to truly augment capabilities with data, such as providing data recommendations to data catalog users and the users of other services in a modern data platform.

Enterprise-class capabilities. Your data is important, and you need enterprise-class capabilities to use it properly, such as identity and access management, and main capabilities via REST APIs. This would also mean that customers and partners can contribute metadata (such as custom harvesters) and also expose data catalog capabilities in their own applications via REST.

In addition to all of that, your data catalog should become your de-facto system catalog, providing abstraction across all of your persistence layers like object store, Hadoop, databases, data warehouse, and for querying services that work across all of your data stores.

And that’s also why a data catalog is no longer a nice to have. It’s a necessity.

Why Oracle Cloud Infrastructure Data Catalog?

Every organization should have a strong data catalog. But why do you want Oracle Cloud Infrastructure Data Catalog?

Oracle Cloud Infrastructure Data Catalog is included with all Oracle Cloud Infrastructure subscriptions and helps customers organize and govern their data assets. It is a single collaborative solution for data professionals to not just organize and govern data, but also collect, access, enrich, and activate technical, business, and operational metadata to support self-service data discovery and governance for trusts data assets in Oracle Cloud and beyond.

From a practical level, it will:

Harvest technical metadata about data assets on Oracle Cloud Infrastructure such as Oracle Cloud Infrastructure Object Storage, Oracle Autonomous Database, Oracle Database.
Search and explore appropriate data from variety of different sources through multi-faceted search and filters
Manage business glossary to capture business vocabulary of the enterprise
Enrich understanding of available data by capturing tribal knowledge in the form user defined tags and annotation
Gain a holistic view of data assets by associating tags and business terms
Integrate capabilities into other apps using REST APIs and SDKs
Secure access with IAM group based policies

Conclusion

Organizations are striving to be data-driven. They want better, faster analytics, without sacrificing governance. And that’s what is making data management even more important and challenging. A data catalog helps make data management easier to manage, and it makes fulfilling the many demands easier. Through Oracle Cloud Infrastructure Data Catalog, Oracle has taken steps to help everyone discover and use data in the way they’ve always wanted.

Try Oracle Cloud Free Tier