GLobal Index#

Interface#

class GlobalIndexFileReader#

Abstract interface for reading global index files from storage.

Public Functions

virtual ~GlobalIndexFileReader() = default#
virtual Result<std::unique_ptr<InputStream>> GetInputStream(const std::string &file_name) const = 0#

Opens an input stream for reading the specified global index file.

class GlobalIndexFileWriter#

Abstract interface for writing global index files to storage.

Public Functions

virtual ~GlobalIndexFileWriter() = default#
virtual Result<std::string> NewFileName(const std::string &prefix) const = 0#

Generates a unique file name for a new index file using the given prefix.

Note

This function may be called multiple times if the index consists of multiple files.

virtual Result<std::unique_ptr<OutputStream>> NewOutputStream(const std::string &file_name) const = 0#

Opens a new output stream for writing index data to the specified file.

virtual Result<int64_t> GetFileSize(const std::string &file_name) const = 0#

Get the file size of input file name.

class GlobalIndexEvaluator#

Abstract base class for evaluating predicates against a global index.

Public Functions

virtual ~GlobalIndexEvaluator() = default#
virtual Result<std::optional<std::shared_ptr<GlobalIndexResult>>> Evaluate(const std::shared_ptr<Predicate> &predicate) = 0#

Evaluates a predicate against the global index.

Note

Top-K predicates are not handled by this method. Use GlobalIndexReader::VisitTopK() for Top-K specific index evaluation.

Parameters:

predicate – The filter predicate to evaluate.

Returns:

A Result containing:

  • std::nullopt if the predicate cannot be evaluated by this index (e.g., field has no index),

  • A std::shared_ptr<GlobalIndexResult> if evaluation succeeds. The GlobalIndexResult indicates the matching rows (e.g., via row ID bitmaps).

class GlobalIndexReader : public paimon::FunctionVisitor<std::shared_ptr<GlobalIndexResult>>#

Reads and evaluates filter predicates against a global file index.

GlobalIndexReader is an implementation of the FunctionVisitor interface specialized to produce std::shared_ptr<GlobalIndexResult> objects.

Derived classes are expected to implement the visitor methods (e.g., VisitEqual, VisitIsNull, etc.) to return index-based results that indicate which row satisfy the given predicate.

Public Types

using TopKPreFilter = std::function<bool(int64_t)>#

TopKPreFilter: A lightweight pre-filtering function applied before similarity scoring.

It operates solely on row_id and is typically driven by other global index, such as bitmap, or range index. This filter enables early pruning of irrelevant candidates (e.g., “only

consider rows with label X”), significantly reducing the search space. Returns true to include the row in Top-K computation; false to exclude it.

Note

Must be thread-safe.

Public Functions

virtual Result<std::shared_ptr<TopKGlobalIndexResult>> VisitTopK(int32_t k, const std::vector<float> &query, TopKPreFilter filter, const std::shared_ptr<Predicate> &predicate) = 0#

VisitTopK performs approximate top-k similarity search.

Note

All fields referenced in the predicate must have been materialized in the index during build to ensure availability.

Note

VisitTopK is thread-safe while other VisitXXX is not.

Parameters:
  • k – Number of top results to return.

  • query – The query vector (must match the dimensionality of the indexed vectors).

  • filter – A pre-filter based on row_id, implemented by leveraging other global index structures (e.g., bitmap index) for efficient candidate pruning.

  • predicate – A runtime filtering condition that may involve graph traversal of structured attributes. Using this parameter often yields better filtering accuracy because during index construction, the underlying graph was built with explicit consideration of field connectivity (e.g., relationships between attributes). As a result, predicates can leverage this pre-established semantic structure to perform more meaningful and context-aware filtering at query time.

struct GlobalIndexIOMeta#

Metadata describing a single file entry in a global index.

Public Functions

inline GlobalIndexIOMeta(const std::string &_file_name, int64_t _file_size, const Range &_row_id_range, const std::shared_ptr<Bytes> &_metadata)#

Public Members

std::string file_name#
int64_t file_size#
Range row_id_range#

The inclusive range of row IDs covered by this file (i.e., [from, to]).

std::shared_ptr<Bytes> metadata#

Optional binary metadata associated with the file, such as serialized secondary index structures or inline index bytes.

May be null if no additional metadata is available.

class GlobalIndexResult : public std::enable_shared_from_this<GlobalIndexResult>#

Global index result to get selected global row ids.

Subclassed by paimon::BitmapGlobalIndexResult, paimon::TopKGlobalIndexResult

Public Functions

virtual ~GlobalIndexResult() = default#
virtual Result<bool> IsEmpty() const = 0#

Checks whether the global index result contains no matching row IDs.

Returns:

A Result<bool> where:

  • true indicates the result is empty (no matching rows),

  • false indicates at least one matching row exists,

  • An error is returned only if internal state is corrupted or I/O fails (e.g., during lazy loading of index data).

virtual Result<std::unique_ptr<Iterator>> CreateIterator() const = 0#

Creates a new iterator over the selected global row ids.

virtual Result<std::shared_ptr<GlobalIndexResult>> And(const std::shared_ptr<GlobalIndexResult> &other)#

Computes the logical AND (intersection) between current result and another.

virtual Result<std::shared_ptr<GlobalIndexResult>> Or(const std::shared_ptr<GlobalIndexResult> &other)#

Computes the logical OR (union) between this result and another.

virtual std::string ToString() const = 0#

Public Static Functions

static Result<PAIMON_UNIQUE_PTR<Bytes>> Serialize(const std::shared_ptr<GlobalIndexResult> &global_index_result, const std::shared_ptr<MemoryPool> &pool)#

Serializes a GlobalIndexResult object into a byte array.

Note

This method only supports the following concrete implementations:

  • BitmapTopKGlobalIndexResult

  • BitmapGlobalIndexResult

Parameters:
  • global_index_result – The GlobalIndexResult instance to serialize (must not be null).

  • pool – Memory pool used to allocate the output byte buffer.

Returns:

A Result containing a unique pointer to the serialized Bytes on success, or an error status on failure.

static Result<std::shared_ptr<GlobalIndexResult>> Deserialize(const char *buffer, size_t length, const std::shared_ptr<MemoryPool> &pool)#

Deserializes a GlobalIndexResult object from a raw byte buffer.

Note

The concrete type of the deserialized object is determined by metadata embedded in the buffer. Currently, only the following types are supported:

  • BitmapTopKGlobalIndexResult

  • BitmapGlobalIndexResult

Parameters:
  • buffer – Pointer to the serialized byte data (must not be null).

  • length – Size of the buffer in bytes.

  • pool – Memory pool used to allocate internal objects during deserialization.

Returns:

A Result containing a shared pointer to the reconstructed GlobalIndexResult on success, or an error status on failure.

class Iterator#

Iterator interface for traversing selected global row ids.

Subclassed by paimon::BitmapGlobalIndexResult::Iterator

Public Functions

virtual ~Iterator() = default#
virtual bool HasNext() const = 0#

Checks whether more row ids are available.

virtual int64_t Next() = 0#
Returns:

The next global row id and advances the iterator.

class GlobalIndexScan#

Represents a logical scan over a global index for a table.

Public Functions

virtual ~GlobalIndexScan() = default#
virtual Result<std::shared_ptr<RowRangeGlobalIndexScanner>> CreateRangeScan(const Range &range) = 0#

Creates a scanner for the global index over the specified row ID range.

This method instantiates a low-level scanner that can evaluate predicates and retrieve matching row IDs from the global index data corresponding to the given row ID range.

Parameters:

range – The inclusive row ID range [start, end] for which to create the scanner. The range must be fully covered by existing global index data (from GetRowRangeList()).

Returns:

A Result containing a range-level scanner, or an error if parse index meta fails.

virtual Result<std::vector<Range>> GetRowRangeList() = 0#

Returns row ID ranges covered by this global index (sorted and non-overlapping ranges).

Each Range represents a contiguous segment of row IDs for which global index data exists. This allows the query engine to parallelize scanning and be aware of ranges that are not covered by any global index.

Returns:

A Result containing sorted and non-overlapping Range objects.

Public Static Functions

static Result<std::unique_ptr<GlobalIndexScan>> Create(const std::string &table_path, const std::optional<int64_t> &snapshot_id, const std::optional<std::vector<std::map<std::string, std::string>>> &partitions, const std::map<std::string, std::string> &options, const std::shared_ptr<FileSystem> &file_system, const std::shared_ptr<MemoryPool> &pool)#

Creates a GlobalIndexScan instance for the specified table and context.

Parameters:
  • table_path – Root directory of the table.

  • snapshot_id – Optional snapshot ID to read from; if not provided, uses the latest.

  • partitions – Optional list of partition specs to restrict the scan scope. Each map represents one partition (e.g., {“dt”: “2024-06-01”}). If omitted, scans all partitions.

  • options – Index-specific configuration.

  • file_system – File system for accessing index files. If not provided (nullptr), it is inferred from the FILE_SYSTEM key in the options parameter.

  • pool – Memory pool for temporary allocations; if nullptr, uses default.

Returns:

A Result containing a unique pointer to the created scanner, or an error if initialization fails (e.g., I/O error).

class GlobalIndexWriter#

Abstract interface for building a global index from Arrow data batches.

Public Functions

virtual ~GlobalIndexWriter() = default#
virtual Status AddBatch(::ArrowArray *arrow_array) = 0#

Builds index structures from a batch of columnar data.

Parameters:

arrow_array – A valid C ArrowArray pointer representing a struct array. Must not be nullptr, and must conform to the expected schema.

Returns:

Status::OK() on success; otherwise, an error indicating malformed input, I/O failure, or unsupported type, etc.

virtual Result<std::vector<GlobalIndexIOMeta>> Finish() = 0#

Finalizes the index build process and returns metadata for persisted index.

class GlobalIndexerFactory : public paimon::Factory#

Factory for creating GlobalIndexer instances based on index type identifiers.

Public Functions

~GlobalIndexerFactory() override = default#
virtual Result<std::unique_ptr<GlobalIndexer>> Create(const std::map<std::string, std::string> &options) const = 0#

Creates a GlobalIndexer using the current factory’s implementation and the given options.

Public Static Functions

static Result<std::unique_ptr<GlobalIndexer>> Get(const std::string &identifier, const std::map<std::string, std::string> &options)#

Creates a GlobalIndexer instance by looking up a registered factory using an identifier.

The provided identifier is automatically appended with GLOBAL_INDEX_IDENTIFIER_SUFFIX (e.g., “-global”) to form the full key used for factory lookup. This ensures namespace separation between file and global index types.

Parameters:
  • identifier – The base name of the index type (e.g., “bitmap”).

  • options – Configuration parameters for the indexer.

Returns:

A Result containing a unique pointer to the created GlobalIndexer, or an error if creation fails.

Returns:

nullptr if no matching factory.

Public Static Attributes

static const char GLOBAL_INDEX_IDENTIFIER_SUFFIX[]#

Suffix used to distinguish global index identifiers (e.g., “bitmap-global”).

class GlobalIndexer#

Interface for creating global index readers and writers.

Public Functions

virtual ~GlobalIndexer() = default#
virtual Result<std::shared_ptr<GlobalIndexWriter>> CreateWriter(const std::string &field_name, ::ArrowSchema *arrow_schema, const std::shared_ptr<GlobalIndexFileWriter> &file_writer, const std::shared_ptr<MemoryPool> &pool) const = 0#

Creates a writer for building a global index on a specific field.

Parameters:
  • field_name – Name of the field to be indexed.

  • arrow_schema – Schema of the input Arrow struct array. It must contain the field specified by field_name and may include additional associated fields used during index construction.

  • file_writer – I/O handler for persisting index data to storage.

  • pool – Memory pool for temporary allocations; if nullptr, uses default.

Returns:

A Result containing a shared pointer to the created GlobalIndexWriter, or an error if the field is not found, unsupported, or initialization fails, etc.

virtual Result<std::shared_ptr<GlobalIndexReader>> CreateReader(::ArrowSchema *arrow_schema, const std::shared_ptr<GlobalIndexFileReader> &file_reader, const std::vector<GlobalIndexIOMeta> &files, const std::shared_ptr<MemoryPool> &pool) const = 0#

Creates a reader for querying a pre-built global index.

Parameters:
  • arrow_schema – Schema of the indexed data; used to interpret predicate literals.

  • file_reader – I/O handler for reading index artifacts from storage.

  • files – List of index file metadata entries produced during writing.

  • pool – Memory pool for temporary allocations; if nullptr, uses default.

Returns:

A Result containing a shared pointer to the created GlobalIndexReader, or an error if the index cannot be loaded or is incompatible, etc.

class RowRangeGlobalIndexScanner#

Interface for scanning global index data at the range level.

Public Functions

virtual ~RowRangeGlobalIndexScanner() = default#
virtual Result<std::shared_ptr<GlobalIndexEvaluator>> CreateIndexEvaluator() const = 0#

Creates a GlobalIndexEvaluator tailored to this range’s index layout.

The returned evaluator can be used to assess whether a given predicate can be answered using the global index data of this shard (e.g., via bitmap intersection).

Returns:

A Result containing a shared pointer to the evaluator, or an error if the index metadata is invalid or unsupported.

virtual Result<std::shared_ptr<GlobalIndexReader>> CreateReader(const std::string &field_name, const std::string &index_type) const = 0#

Creates a GlobalIndexReader for a specific field and index type within this range.

This reader provides low-level access to the serialized index data for the given column (field_name) and index kind (index_type, such as “bitmap”).

Parameters:
  • field_name – Name of the indexed column.

  • index_type – Type of the global index (e.g., “bitmap”, “lumina”).

Returns:

A Result that is:

  • Successful with a non-null reader if the index exists and loads correctly;

  • Successful with a null pointer if no index was built for the given field and type;

  • An error only if loading fails (e.g., file corruption, I/O error, unsupported format).

class RowRangeGlobalIndexWriter#

Writes a range-level global index for a specific data split and field.

Public Functions

RowRangeGlobalIndexWriter() = delete#
~RowRangeGlobalIndexWriter() = delete#

Public Static Functions

static Result<std::shared_ptr<CommitMessage>> WriteIndex(const std::string &table_path, const std::string &field_name, const std::string &index_type, const std::shared_ptr<IndexedSplit> &indexed_split, const std::map<std::string, std::string> &options, const std::shared_ptr<MemoryPool> &pool)#

Builds and writes a global index for the specified data range.

Parameters:
  • table_path – Path to the table root directory where index files are stored.

  • field_name – Name of the indexed column (must be present in the table schema).

  • index_type – Type of global index to build (e.g., “bitmap”, “lumina”).

  • index_split – The indexed split containing the actual data (e.g., Parquet file) and The range must be fully contained within the data covered by the given split.

  • options – Index-specific configuration (e.g., false positive rate for bloom filters).

  • pool – Memory pool for temporary allocations during index construction.

Returns:

A Result containing a shared pointer to the CommitMessage with index metadata, or an error if indexing fails (e.g., unsupported type, I/O error).