Parquet Metadata Cache#

Overview#

paimon-cpp can cache serialized Parquet metadata footer bytes for Parquet data files. The cache is used by ParquetReaderBuilder before opening the Arrow Parquet reader. On a cache miss, paimon-cpp loads the Parquet file metadata, serializes it as a complete metadata footer, and stores those bytes in the public Cache abstraction. On a cache hit, paimon-cpp parses the cached footer bytes into parquet::FileMetaData and passes the metadata to the Parquet reader.

The cache stores serialized metadata footer bytes instead of caching a parquet::FileMetaData instance. This keeps the cache value compact and similar to manifest cache values: the cache weight follows the actual cached bytes, while the Parquet library still owns metadata parsing and validation.

This optimization is useful when the same Parquet files are opened repeatedly in the same process, for example repeated get or scan requests over the same snapshot. On a cache hit, the read path avoids reading the Parquet footer bytes from the filesystem again. paimon-cpp still parses the cached footer bytes into parquet::FileMetaData for each reader open. Data pages, page indexes, and column chunks are still read from the file as usual.

Configuration#

Parquet metadata caching is disabled by default. Embedding applications that need it can provide a custom Cache implementation and inject it through ScanContextBuilder or ReadContextBuilder. Parquet reader builders receive the cache from the read context and create cache keys with CacheKind::DATA_FILE_FOOTER internally.

The cache key represents the file footer and is created from the file URI with position -1 and length -1. Callers do not need to construct this key directly; they only need to route CacheKind::DATA_FILE_FOOTER entries to an appropriate cache backend.

Example:

class RoutingCache : public paimon::Cache {
 public:
  RoutingCache(std::shared_ptr<paimon::Cache> default_cache,
               std::shared_ptr<paimon::Cache> parquet_metadata_cache)
      : default_cache_(std::move(default_cache)),
        parquet_metadata_cache_(std::move(parquet_metadata_cache)) {}

  paimon::Result<std::shared_ptr<paimon::CacheValue>> Get(
      const std::shared_ptr<paimon::CacheKey>& key,
      std::function<paimon::Result<std::shared_ptr<paimon::CacheValue>>(
          const std::shared_ptr<paimon::CacheKey>&)> supplier) override {
    return Select(key)->Get(key, std::move(supplier));
  }

  // Put(), Invalidate(), InvalidateAll(), and Size() route in the same way.

 private:
  std::shared_ptr<paimon::Cache> Select(
      const std::shared_ptr<paimon::CacheKey>& key) const {
    return key && key->GetKind() == paimon::CacheKind::DATA_FILE_FOOTER
               ? parquet_metadata_cache_
               : default_cache_;
  }

  std::shared_ptr<paimon::Cache> default_cache_;
  std::shared_ptr<paimon::Cache> parquet_metadata_cache_;
};

auto cache = std::make_shared<RoutingCache>(
    std::make_shared<MyDefaultCache>(),
    std::make_shared<MyParquetMetadataCache>());

paimon::ScanContextBuilder scan_builder(table_path);
scan_builder.WithCache(cache);

paimon::ReadContextBuilder read_builder(table_path);
read_builder.WithCache(cache);

Passing nullptr or omitting WithCache() leaves Parquet metadata caching disabled. If a file URI cannot be obtained, paimon-cpp also bypasses the cache and opens the Parquet file normally.

Future Optimizations#

  • Add hit, miss, bypass, and eviction metrics for Parquet metadata cache.

  • Add single-flight loading for high-concurrency misses on the same Parquet file.

  • Evaluate sharing cached metadata footer bytes with page-index prefetch logic when those read paths can use the same cache abstraction.