Global Index#

Interface#

class GlobalIndexerFactory : public paimon::Factory#

Factory for creating GlobalIndexer instances based on index type identifiers.

Public Functions

~GlobalIndexerFactory() override = default#
virtual Result<std::unique_ptr<GlobalIndexer>> Create(const std::map<std::string, std::string> &options) const = 0#

Creates a GlobalIndexer using the current factory’s implementation and the given options.

Public Static Functions

static Result<std::unique_ptr<GlobalIndexer>> Get(const std::string &identifier, const std::map<std::string, std::string> &options)#

Creates a GlobalIndexer instance by looking up a registered factory using an identifier.

The provided identifier is automatically appended with GLOBAL_INDEX_IDENTIFIER_SUFFIX (e.g., “-global”) to form the full key used for factory lookup. This ensures namespace separation between file and global index types.

Parameters:
  • identifier – The base name of the index type (e.g., “bitmap”).

  • options – Configuration parameters for the indexer.

Returns:

A Result containing a unique pointer to the created GlobalIndexer, or an error if creation fails.

Returns:

nullptr if no matching factory.

Public Static Attributes

static const char GLOBAL_INDEX_IDENTIFIER_SUFFIX[]#

Suffix used to distinguish global index identifiers (e.g., “bitmap-global”).

class GlobalIndexer#

Interface for creating global index readers and writers.

Public Functions

virtual ~GlobalIndexer() = default#
virtual Result<std::shared_ptr<GlobalIndexWriter>> CreateWriter(const std::string &field_name, ::ArrowSchema *arrow_schema, const std::shared_ptr<GlobalIndexFileWriter> &file_writer, const std::shared_ptr<MemoryPool> &pool) const = 0#

Creates a writer for building a global index on a specific field.

Parameters:
  • field_name – Name of the field to be indexed.

  • arrow_schema – Schema of the input Arrow struct array. It must contain the field specified by field_name and may include additional associated fields used during index construction.

  • file_writer – I/O handler for persisting index data to storage.

  • pool – Memory pool for temporary allocations; if nullptr, uses default.

Returns:

A Result containing a shared pointer to the created GlobalIndexWriter, or an error if the field is not found, unsupported, or initialization fails, etc.

virtual Result<std::shared_ptr<GlobalIndexReader>> CreateReader(::ArrowSchema *arrow_schema, const std::shared_ptr<GlobalIndexFileReader> &file_reader, const std::vector<GlobalIndexIOMeta> &files, const std::shared_ptr<MemoryPool> &pool) const = 0#

Creates a reader for querying a pre-built global index.

Parameters:
  • arrow_schema – Schema of the indexed data; used to interpret predicate literals.

  • file_reader – I/O handler for reading index artifacts from storage.

  • files – List of index file metadata entries produced during writing.

  • pool – Memory pool for temporary allocations; if nullptr, uses default.

Returns:

A Result containing a shared pointer to the created GlobalIndexReader, or an error if the index cannot be loaded or is incompatible, etc.

class GlobalIndexReader : public paimon::FunctionVisitor<std::shared_ptr<GlobalIndexResult>>#

Reads and evaluates filter predicates against a global file index.

Derived classes are expected to implement the visitor methods (e.g., VisitEqual, VisitIsNull, etc.) to return index-based results that indicate which row satisfy the given predicate.

Note

All GlobalIndexResult objects returned by implementations of this class use local row ids that start from 0 — not global row ids in the entire table. The GlobalIndexResult can be converted to global row ids by calling AddOffset().

Public Functions

virtual Result<std::shared_ptr<ScoredGlobalIndexResult>> VisitVectorSearch(const std::shared_ptr<VectorSearch> &vector_search) = 0#

VisitVectorSearch performs approximate vector similarity search.

Warning

VisitVectorSearch may return error status when it is incorrectly invoked (e.g., BitmapGlobalIndexReader call VisitVectorSearch).

virtual Result<std::shared_ptr<GlobalIndexResult>> VisitFullTextSearch(const std::shared_ptr<FullTextSearch> &full_text_search) = 0#

VisitFullTextSearch performs full text search.

virtual bool IsThreadSafe() const = 0#
Returns:

true if the reader is thread-safe; false otherwise.

virtual std::string GetIndexType() const = 0#
Returns:

An identifier representing the index type. (e.g., “bitmap”, “lumina”).

class GlobalIndexFileReader#

Abstract interface for reading global index files from storage.

Public Functions

virtual ~GlobalIndexFileReader() = default#
virtual Result<std::unique_ptr<InputStream>> GetInputStream(const std::string &file_path) const = 0#

Opens an input stream for reading the specified global index file.

struct GlobalIndexIOMeta#

Metadata describing a single file entry in a global index.

Public Functions

inline GlobalIndexIOMeta(const std::string &_file_path, int64_t _file_size, int64_t _range_end, const std::shared_ptr<Bytes> &_metadata)#

Public Members

std::string file_path#
int64_t file_size#
int64_t range_end#

The inclusive range end covered by this file (i.e., the last local row id).

std::shared_ptr<Bytes> metadata#

Optional binary metadata associated with the file, such as serialized secondary index structures or inline index bytes.

May be null if no additional metadata is available.

class GlobalIndexResult : public std::enable_shared_from_this<GlobalIndexResult>#

Global index result to get selected global row ids.

Subclassed by paimon::BitmapGlobalIndexResult, paimon::ScoredGlobalIndexResult

Public Functions

virtual ~GlobalIndexResult() = default#
virtual Result<bool> IsEmpty() const = 0#

Checks whether the global index result contains no matching row ids.

Returns:

A Result<bool> where:

  • true indicates the result is empty (no matching rows),

  • false indicates at least one matching row exists,

  • An error is returned only if internal state is corrupted or I/O fails (e.g., during lazy loading of index data).

virtual Result<std::unique_ptr<Iterator>> CreateIterator() const = 0#

Creates a new iterator over the selected global row ids.

Result<std::vector<Range>> ToRanges() const#

Returns non-overlapping, sorted ranges covering all row ids in GlobalIndexResult.

virtual Result<std::shared_ptr<GlobalIndexResult>> And(const std::shared_ptr<GlobalIndexResult> &other)#

Computes the logical AND (intersection) between current result and another.

virtual Result<std::shared_ptr<GlobalIndexResult>> Or(const std::shared_ptr<GlobalIndexResult> &other)#

Computes the logical OR (union) between this result and another.

virtual Result<std::shared_ptr<GlobalIndexResult>> AddOffset(int64_t offset) = 0#

Adds the given offset to each row id in current result and returns the new global index result.

virtual std::string ToString() const = 0#

Public Static Functions

static Result<PAIMON_UNIQUE_PTR<Bytes>> Serialize(const std::shared_ptr<GlobalIndexResult> &global_index_result, const std::shared_ptr<MemoryPool> &pool)#

Serializes a GlobalIndexResult object into a byte array.

Note

This method only supports the following concrete implementations:

Parameters:
  • global_index_result – The GlobalIndexResult instance to serialize (must not be null).

  • pool – Memory pool used to allocate the output byte buffer.

Returns:

A Result containing a unique pointer to the serialized Bytes on success, or an error status on failure.

static Result<std::shared_ptr<GlobalIndexResult>> Deserialize(const char *buffer, size_t length, const std::shared_ptr<MemoryPool> &pool)#

Deserializes a GlobalIndexResult object from a raw byte buffer.

Note

The concrete type of the deserialized object is determined by metadata embedded in the buffer. Currently, only the following types are supported:

Parameters:
  • buffer – Pointer to the serialized byte data (must not be null).

  • length – Size of the buffer in bytes.

  • pool – Memory pool used to allocate internal objects during deserialization.

Returns:

A Result containing a shared pointer to the reconstructed GlobalIndexResult on success, or an error status on failure.

class Iterator#

Iterator interface for traversing selected global row ids.

Subclassed by paimon::BitmapGlobalIndexResult::Iterator

Public Functions

virtual ~Iterator() = default#
virtual bool HasNext() const = 0#

Checks whether more row ids are available.

virtual int64_t Next() = 0#
Returns:

The next global row id and advances the iterator.

class BitmapGlobalIndexResult : public paimon::GlobalIndexResult#

Represents a global index query result that lazily materializes its matching row ids as a Roaring bitmap.

The underlying 64-bit Roaring bitmap is not constructed during object creation; instead, it is built on-demand the first time GetBitmap() is called. This design avoids unnecessary computation and memory allocation when the bitmap is not needed (e.g., during early stopping).

Public Types

using BitmapSupplier = std::function<Result<RoaringBitmap64>()>#

Public Functions

inline explicit BitmapGlobalIndexResult(BitmapSupplier bitmap_supplier)#
virtual Result<std::unique_ptr<GlobalIndexResult::Iterator>> CreateIterator() const override#

Creates a new iterator over the selected global row ids.

virtual Result<std::shared_ptr<GlobalIndexResult>> And(const std::shared_ptr<GlobalIndexResult> &other) override#

Computes the logical AND (intersection) between current result and another.

virtual Result<std::shared_ptr<GlobalIndexResult>> Or(const std::shared_ptr<GlobalIndexResult> &other) override#

Computes the logical OR (union) between this result and another.

virtual Result<bool> IsEmpty() const override#

Checks whether the global index result contains no matching row ids.

Returns:

A Result<bool> where:

  • true indicates the result is empty (no matching rows),

  • false indicates at least one matching row exists,

  • An error is returned only if internal state is corrupted or I/O fails (e.g., during lazy loading of index data).

virtual Result<std::shared_ptr<GlobalIndexResult>> AddOffset(int64_t offset) override#

Adds the given offset to each row id in current result and returns the new global index result.

virtual std::string ToString() const override#
Result<const RoaringBitmap64*> GetBitmap() const#

Note

Lazy initialization: The bitmap is constructed only on the first call to this method. Subsequent calls return the cached instance. Construction may involve non-trivial CPU/IO cost (e.g., read indexes or merging bitmap), so avoid calling this if the bitmap is not actually required. Not thread-safe.

Returns:

A non-owning, const pointer to the bitmap. The returned pointer is valid as long as this BitmapGlobalIndexResult object is alive. The caller must not modify the bitmap.

Public Static Functions

static std::shared_ptr<BitmapGlobalIndexResult> FromRanges(const std::vector<Range> &ranges)#

Creates BitmapGlobalIndexResult for all row ids in the given ranges.

Note

Overlapping or unsorted ranges are accepted.

class Iterator : public paimon::GlobalIndexResult::Iterator#

Public Functions

inline Iterator(const RoaringBitmap64 *bitmap, RoaringBitmap64::Iterator &&iter)#
inline virtual bool HasNext() const override#

Checks whether more row ids are available.

inline virtual int64_t Next() override#
Returns:

The next global row id and advances the iterator.

class BitmapScoredGlobalIndexResult : public paimon::ScoredGlobalIndexResult#

Represents a scored global index result that combines a Roaring bitmap of candidate row ids with an array of associated relevance scores.

Important Ordering Note: Inheriting from ScoredGlobalIndexResult, the results are NOT sorted by score. Instead, both the bitmap and the score vector are ordered by ascending row id. This design enables efficient merging and set operations while preserving row id-to-score mapping.

Public Functions

inline BitmapScoredGlobalIndexResult(RoaringBitmap64 &&bitmap, std::vector<float> &&scores)#
virtual Result<std::unique_ptr<GlobalIndexResult::Iterator>> CreateIterator() const override#

Creates a new iterator over the selected global row ids.

virtual Result<std::unique_ptr<ScoredGlobalIndexResult::ScoredIterator>> CreateScoredIterator() const override#

Creates a new iterator for traversing the scored results.

virtual Result<std::shared_ptr<GlobalIndexResult>> And(const std::shared_ptr<GlobalIndexResult> &other) override#

Computes the logical AND (intersection) between current result and another.

virtual Result<std::shared_ptr<GlobalIndexResult>> Or(const std::shared_ptr<GlobalIndexResult> &other) override#

Computes the logical OR (union) between this result and another.

virtual Result<std::shared_ptr<GlobalIndexResult>> AddOffset(int64_t offset) override#

Adds the given offset to each row id in current result and returns the new global index result.

virtual Result<bool> IsEmpty() const override#

Checks whether the global index result contains no matching row ids.

Returns:

A Result<bool> where:

  • true indicates the result is empty (no matching rows),

  • false indicates at least one matching row exists,

  • An error is returned only if internal state is corrupted or I/O fails (e.g., during lazy loading of index data).

virtual std::string ToString() const override#
Result<const RoaringBitmap64*> GetBitmap() const#
Returns:

A non-owning, const pointer to the bitmap. The row ids in the bitmap are stored in ascending order (as guaranteed by Roaring64 iteration).

const std::vector<float> &GetScores() const#
Returns:

A const reference to a vector of float scores, where the i-th element corresponds to the i-th row id when iterating the bitmap in ascending row id order.

class ScoredIterator : public paimon::ScoredGlobalIndexResult::ScoredIterator#

Public Functions

inline ScoredIterator(const RoaringBitmap64 *bitmap, RoaringBitmap64::Iterator &&iter, const float *scores)#
inline virtual bool HasNext() const override#

Checks whether more row ids are available.

inline virtual std::pair<int64_t, float> NextWithScore() override#

Retrieves the next (row_id, score) pair and advances the iterator.

Note

The sequence is ordered by row_id, not by score.

Returns:

A pair where:

  • first: the global row id (returned in ascending order),

  • second: the associated score computed by the index.

class GlobalIndexScan#

Represents a logical scan over a global index for a table.

Public Functions

virtual ~GlobalIndexScan() = default#
virtual Result<std::shared_ptr<RowRangeGlobalIndexScanner>> CreateRangeScan(const Range &range) = 0#

Creates a scanner for the global index over the specified row id range.

This method instantiates a low-level scanner that can evaluate predicates and retrieve matching row ids from the global index data corresponding to the given row id range.

Parameters:

range – The inclusive row id range [start, end] for which to create the scanner. The range must be fully covered by existing global index data (from GetRowRangeList()).

Returns:

A Result containing a range-level scanner, or an error if parse index meta fails.

virtual Result<std::vector<Range>> GetRowRangeList() = 0#

Returns row id ranges covered by this global index (sorted and non-overlapping ranges).

Each Range represents a contiguous segment of row ids for which global index data exists. This allows the query engine to parallelize scanning and be aware of ranges that are not covered by any global index.

Returns:

A Result containing sorted and non-overlapping Range objects.

Public Static Functions

static Result<std::unique_ptr<GlobalIndexScan>> Create(const std::string &table_path, const std::optional<int64_t> &snapshot_id, const std::optional<std::vector<std::map<std::string, std::string>>> &partitions, const std::map<std::string, std::string> &options, const std::shared_ptr<FileSystem> &file_system, const std::shared_ptr<MemoryPool> &pool)#

Creates a GlobalIndexScan instance for the specified table and context.

Parameters:
  • table_path – Root directory of the table.

  • snapshot_id – Optional snapshot id to read from; if not provided, uses the latest.

  • partitions – Optional list of specific partitions to restrict the scan scope. Each map represents one partition (e.g., {“dt”: “2024-06-01”}). If omitted, scans all partitions.

  • options – Index-specific configuration.

  • file_system – File system for accessing index files. If not provided (nullptr), it is inferred from the FILE_SYSTEM key in the options parameter.

  • pool – Memory pool for temporary allocations; if nullptr, uses default.

Returns:

A Result containing a unique pointer to the created scanner, or an error if initialization fails (e.g., I/O error).

static Result<std::unique_ptr<GlobalIndexScan>> Create(const std::string &root_path, const std::optional<int64_t> &snapshot_id, const std::shared_ptr<Predicate> &partition_filters, const std::map<std::string, std::string> &options, const std::shared_ptr<FileSystem> &file_system, const std::shared_ptr<MemoryPool> &memory_pool)#

Creates a GlobalIndexScan instance for the specified table and context.

Parameters:

partition_filters – Optional specific partition predicates.

class RowRangeGlobalIndexScanner#

Interface for scanning global index data at the range level.

Public Functions

virtual ~RowRangeGlobalIndexScanner() = default#
virtual Result<std::shared_ptr<GlobalIndexReader>> CreateReader(const std::string &field_name, const std::string &index_type) const = 0#

Creates a GlobalIndexReader for a specific field and index type within this range.

This reader provides low-level access to the serialized index data for the given column (field_name) and index kind (index_type, such as “bitmap”).

Note

All GlobalIndexResult objects returned by GlobalIndexReader use local row ids that start from 0 — not global row ids in the entire table.

Parameters:
  • field_name – Name of the indexed column.

  • index_type – Type of the global index (e.g., “bitmap”, “lumina”).

Returns:

A Result that is:

  • Successful with a non-null reader if the index exists and loads correctly;

  • Successful with a null pointer if no index was built for the given field and type;

  • An error only if loading fails (e.g., file corruption, I/O error, unsupported format).

virtual Result<std::vector<std::shared_ptr<GlobalIndexReader>>> CreateReaders(const std::string &field_name) const = 0#

Creates several GlobalIndexReaders for a specific field within this range.

Parameters:

field_name – Name of the indexed column.

Returns:

A Result that is:

  • Successful with several readers if the indexes exist and load correctly;

  • Successful with an empty vector if no index was built for the given field;

  • Error returns when loading fails (e.g., file corruption, I/O error, unsupported format).

class IndexedSplit : public paimon::Split#

Indexed split for global index reading operation.

Public Functions

virtual std::shared_ptr<DataSplit> GetDataSplit() const = 0#
Returns:

The underlying physical data split containing actual data file details.

virtual const std::vector<Range> &RowRanges() const = 0#
Returns:

A list of row intervals [start, end] indicating which rows are relevant (e.g., passed predicate pushdown).

virtual const std::vector<float> &Scores() const = 0#
Returns:

A score for each individual row included in RowRanges(), in the order they appear when traversing the ranges.

class GlobalIndexWriteTask#

Writes a range-level global index for a specific data split and field.

Public Functions

GlobalIndexWriteTask() = delete#
~GlobalIndexWriteTask() = delete#

Public Static Functions

static Result<std::shared_ptr<CommitMessage>> WriteIndex(const std::string &table_path, const std::string &field_name, const std::string &index_type, const std::shared_ptr<IndexedSplit> &indexed_split, const std::map<std::string, std::string> &options, const std::shared_ptr<MemoryPool> &pool, const std::shared_ptr<FileSystem> &file_system = nullptr)#

Builds and writes a global index for the specified data range.

Parameters:
  • table_path – Path to the table root directory where index files are stored.

  • field_name – Name of the indexed column (must be present in the table schema).

  • index_type – Type of global index to build (e.g., “bitmap”, “lumina”).

  • index_split – The indexed split containing the actual data (e.g., Parquet file) and The range must be fully contained within the data covered by the given split.

  • options – Index-specific configuration (e.g., false positive rate for bloom filters).

  • pool – Memory pool for temporary allocations during index construction. If nullptr, the system’s default memory pool will be used.

  • file_system – Specifies the file system for file operations. If nullptr, use default file system.

Returns:

A Result containing a shared pointer to the CommitMessage with index metadata, or an error if indexing fails (e.g., unsupported type, I/O error).

class GlobalIndexWriter#

Abstract interface for building a global index from Arrow data batches.

Public Functions

virtual ~GlobalIndexWriter() = default#
virtual Status AddBatch(::ArrowArray *arrow_array) = 0#

Builds index structures from a batch of columnar data.

Parameters:

arrow_array – A valid C ArrowArray pointer representing a struct array. Must not be nullptr, and must conform to the expected schema.

Returns:

Status::OK() on success; otherwise, an error indicating malformed input, I/O failure, or unsupported type, etc.

virtual Result<std::vector<GlobalIndexIOMeta>> Finish() = 0#

Finalizes the index build process and returns metadata for persisted index.

class GlobalIndexFileWriter#

Abstract interface for writing global index files to storage.

Public Functions

virtual ~GlobalIndexFileWriter() = default#
virtual Result<std::string> NewFileName(const std::string &prefix) const = 0#

Generates a unique file name for a new index file using the given prefix.

Note

This function may be called multiple times if the index consists of multiple files.

virtual Result<std::unique_ptr<OutputStream>> NewOutputStream(const std::string &file_name) const = 0#

Opens a new output stream for writing index data to the specified file.

virtual Result<int64_t> GetFileSize(const std::string &file_name) const = 0#

Get the file size of input file name.

virtual std::string ToPath(const std::string &file_name) const = 0#

Get the index file path of input file name.