Metadata — Meet Big Data’s Little Brother

Understand, protect, and leverage metadata in big data systems

Yifeng Jiang
8 min readMar 1, 2022
Photo by Paul Skorupskas on Unsplash

Back in 2009, when I built my first big data system using Hadoop, it was less than 10 nodes, processing single-digit terabytes of data. Nowadays, that amount of data can be stored in a laptop. Data is getting bigger. We get excited about the hundreds of terabytes, or even petabytes of data in our big data and machine learning system. But, how about metadata, the little brother of big data? Are we paying enough attention to metadata?

Metadata is the data thats describes data. Sounds confusing, but metadata is the small but critical pieces of information that makes big data “make sense”. In this blog, I will describe the role metadata plays in big data systems. To understand, protect and leverage metadata, I will view it from both system and business perpectives. I will also briefly introduce softwares and solutions for metadata protection, discovery and management.

Metadata in big data systems

Metadata is not new in computer systems. In Linux, other than regular data like text and image files, there are some other data about these files, such as their size, ownership, permissions etc. This metadata about a file is managed with a data structure known as an inode. The Linux file system such as ext3 or ext4 maintains a list of these inodes called the inode table, which is loaded into memory by the OS when queried. From system perspective, metadata’s role in a big data system is similar to the inode table, but very different in terms of how it is stored and queried.

Distributed storage and data warehouse are the two most fundamental big data systems. Let’s take a closer look at the metadata in these two.

Metadata in distributed storages

Distributed filesystem and object store are the two popular types of distributed storage.

Since many distributed filesystems are or close to POSIX compliance, their metadata are very much like the inode table in Linux filesystems, but much bigger. Often time, how this metadata is stored determines the scalability and performance characters of the filesystem. For example, the Hadoop Distributed File System or HDFS, stores all its metadata (fsimage in Hadoop’s term) in a server called the Namenode. The below is an decoded example of HDFS metadata:

HDFS metadata example

The whole metadata must be loaded into the Namenode’s memory to start and run a HDFS instance. This was a design choice for simplicity when HDFS was implemented, but has become the limits of HDFS scalability. Because the Namenode can only have that much of memory, the maximum number of files a single HDFS instance supports is typically several hundreds of millions.

Other distributed filesystems avoid this limitation by storing their metadata across several or all the nodes of the filesystem. FlashBlade, which supports both NFS and S3, is an example of this architecture. FlashBlade’s metadata is distributed to all the blades (storage nodes) in the system. This is why FlashBlade is able to support unlimited number of files and deliver very fast metadata operations.

Object store trades off some filesystem functionalities for scalability. Because the number of objects it is required to support is so huge that, object store’s metadata itself must be stored in a distributed manner. How it is stored depends on implementation. Apache Ozone, which is designed as an object store in Hadoop to address the HDFS Namenode limitation, stores its metadata in RockDB across multiple manager nodes. FlashBlade S3 shares the core distributed data and metadata service with NFS. In addtion, it natively implements the S3 protocol — one reason why FlashBlade S3 is so fast.

FlashBlade architecture for unified file and object (image source: blog.purestorage.com)

Some object stores though, especially those built on top of traditional file storages, may subject to the limitation a file storage exposes, including metadata operation performance.

Metadata in data warehouses

Data warehouse adds table/partition abstraction on top of data files to support functions like SQL query. Morden data warehouses are usually built on top of distributed storages. In fact, many data warehouses support different storage backend for its data files. Metadata in a data warehouse describes the information about tables, partitions and their association with the data files in the backend storage. Metadata is essential to compile and plan a SQL query in such data warehouse.

For example, Apache Hive’s data files are usually stored in HDFS or S3, while its metadata is stored in a relational database such as Postgres called the Metastore DB, and managed through the Hive Metastore Service (HMS). Trino does the same. In fact, Trino shares the same HMS with Hive.

Metadata about Trino tables in the Hive metastore

Just like any traditional web+DB services, when we have many tables, partitions and data files in the data warehouse, the HMS and its backend database can easily become the bottleneck. To address this limitation, a new generation of table format such as Apache Iceberg and Apache Hudi are invented. Instead of putting all the metadata in the Metastore DB, most of them are now stored along with the data files in the distributed storage. For example, when using Trino with Apache Iceberg, Trino first consults the HMS to get partition locations, it then calls the underlying distributed storage to list all data files inside each partition, and then read metadata from each data file. This not only solves the known scalability limitations of HMS, but also adds new capabilities such as SQL updates, schema evolution and time travle and rollback. Stay tuned for my future blogs on this.

Protecting metadata

Now we understand what is metadata and why it is important in distributed storages and data warehouses. It is obvious that protecting metadata is as important as protecting actual data files.

I have been working on Hadoop and big data systems for more than a decade, among the ecosystem, the one thing I absolutely don’t want to run into trouble is the distributed storage. I have seen very large HDFS clusters lost data or became unavailable for many hours, cost millions of dollars in business. And for multiple times, the trouble was related to metadata, either due to software bugs or operational misses or both. So my first suggestion is, do not build your own distributed storage if you don’t have to. Relay on commercial storage products or services. For those who is brave enough to build your own HDFS or Ozone cluster, the least we could do to protect metadata is to use RAID for the metadata disks on master or manager nodes, and enable high availability (HA) for those services.

Once the distributed storage is taken care of, protecting metadata of data warehouse on top of the storage is relatively easy. Relational database backup and HA best practices apply for the Hive metastore database.

Metadata protection in Kubernetes

Except the storage part which is FlashBlade, I run everything including Spark, Trino, Postgres and many other big data and MLOps applications on Kubernetes. I use Portworx to provide persistent volumes to these applications. Thanks to PX-Backup and the DR and data protection capabilities Portworx and FlashBlade provide, protecting metadata (and the big data files) becomes much easier. As shown in the following screenshots, I take a backup of the metadata namespace where the Metastore DB (Postgres) is running, from the PX-Backup UI. The backup is stored in a FlashBlade bucket called px-backups, which is configured to replicate to Amazon S3 (can also configure PX-Backup to directly store the backups in Amazon S3).

Application-awared Kubernetes backup with PX-Backup
FlashBlade object replication to Amazon S3

With a couple of clicks, I was able to backup my Postgres data (PVs) and all the Kubernetes objects (ConfigMaps, Secrets, etc.) required to restore Postgres in Kubernetes in a consistent and application-awared way.

Postgres data and metadata backups in FlashBlade S3
Replicated backups in Amazon S3

With these, I will be able to do a quick restore from the FlashBlade bucket to my on-premise Kubernetes cluster, or a cloud restore from Amazon S3 to Amazon EKS. To extend this to a simple DR solution, I can then also configure FlashBlade S3 to replicate the big data files, which are normally stored in separate buckets, to Amazon S3. So this protects not only my metadata, but also the big data.

Unlock the value of data assets with metadata

Now we have understood and protected the metadata, what else can we do? Well, we can do much more. Metadata is critical for data discovery, governance and collaboration in Enterprises. To unlock the value of data assets, we would define metadata from business perspective, bring in metadata from multiple data systems, name it in business languages, and organise it in a way people can easily discover, govern and collaborate on the underlaying data. Metadata from business point of view leverages and extends system metadata we described before. It helps answer questions like “what data do I have across my data systems, and their quality and relationship?”.

This is a topic that itself deserves a blog or more, for now I will just briefly introduce the metadata management tool I use. OpenMetadata is an open-source data discovery and collaboration tool. It ingests metadata from various data systems and builds data discovery and collaboration functions on top of that. The screenshots below show metadata of a Trino table including its business description, schema, sample data etc. on the OpenMetadata UI.

Metadata of a Trino table in OpenMetadata

It also shows linage between this table and a dashboard built in Apache Superset.

Linage of a Trino table and Superset dashboard in OpenMetadata

Functions like these become even important when there are many data systems and teams working and collaborating on those data.

OpenMetadata is a relatively new project. It might be a little early to use in production, but I believe in its bright future. It is founded by people I respect. It has an elegant architecture.

Summary

Metadata is the little brother of big data. While we working hard on adding more data into the system and training awesome machine learning models, we shall not forget every big data system comes with its little brother we need to understand and protect. Metadata could be the key differentiator of a big data system, product or even business.

Metadata helps find the treasure in your big data.

--

--

Yifeng Jiang

Software & solutions engineer, big data and machine learning, jogger, hiker, traveler, gamer.