Global Index#
Interface#
-
class GlobalIndexerFactory : public paimon::Factory#
Factory for creating
GlobalIndexerinstances based on index type identifiers.Public Functions
-
~GlobalIndexerFactory() override = default#
-
virtual Result<std::unique_ptr<GlobalIndexer>> Create(const std::map<std::string, std::string> &options) const = 0#
Creates a
GlobalIndexerusing the current factory’s implementation and the given options.
Public Static Functions
-
static Result<std::unique_ptr<GlobalIndexer>> Get(const std::string &identifier, const std::map<std::string, std::string> &options)#
Creates a
GlobalIndexerinstance by looking up a registered factory using an identifier.The provided
identifieris automatically appended withGLOBAL_INDEX_IDENTIFIER_SUFFIX(e.g., “-global”) to form the full key used for factory lookup. This ensures namespace separation between file and global index types.- Parameters:
identifier – The base name of the index type (e.g., “bitmap”).
options – Configuration parameters for the indexer.
- Returns:
A
Resultcontaining a unique pointer to the createdGlobalIndexer, or an error if creation fails.- Returns:
nullptr if no matching factory.
Public Static Attributes
-
static const char GLOBAL_INDEX_IDENTIFIER_SUFFIX[]#
Suffix used to distinguish global index identifiers (e.g., “bitmap-global”).
-
~GlobalIndexerFactory() override = default#
-
class GlobalIndexer#
Interface for creating global index readers and writers.
Public Functions
-
virtual ~GlobalIndexer() = default#
Creates a writer for building a global index on a specific field.
- Parameters:
field_name – Name of the field to be indexed.
arrow_schema – Schema of the input Arrow struct array. It must contain the field specified by field_name and may include additional associated fields used during index construction.
file_writer – I/O handler for persisting index data to storage.
pool – Memory pool for temporary allocations; if nullptr, uses default.
- Returns:
A
Resultcontaining a shared pointer to the createdGlobalIndexWriter, or an error if the field is not found, unsupported, or initialization fails, etc.
Creates a reader for querying a pre-built global index.
- Parameters:
arrow_schema – Schema of the indexed data; used to interpret predicate literals.
file_reader – I/O handler for reading index artifacts from storage.
files – List of index file metadata entries produced during writing.
pool – Memory pool for temporary allocations; if nullptr, uses default.
- Returns:
A
Resultcontaining a shared pointer to the createdGlobalIndexReader, or an error if the index cannot be loaded or is incompatible, etc.
-
virtual ~GlobalIndexer() = default#
-
class GlobalIndexReader : public paimon::FunctionVisitor<std::shared_ptr<GlobalIndexResult>>#
Reads and evaluates filter predicates against a global file index.
Derived classes are expected to implement the visitor methods (e.g.,
VisitEqual,VisitIsNull, etc.) to return index-based results that indicate which row satisfy the given predicate.Note
All
GlobalIndexResultobjects returned by implementations of this class use local row ids that start from 0 — not global row ids in the entire table. TheGlobalIndexResultcan be converted to global row ids by callingAddOffset().Public Functions
VisitVectorSearch performs approximate vector similarity search.
Warning
VisitVectorSearchmay return error status when it is incorrectly invoked (e.g., BitmapGlobalIndexReader callVisitVectorSearch).
VisitFullTextSearch performs full text search.
-
virtual bool IsThreadSafe() const = 0#
- Returns:
true if the reader is thread-safe; false otherwise.
-
virtual std::string GetIndexType() const = 0#
- Returns:
An identifier representing the index type. (e.g., “bitmap”, “lumina”).
-
class GlobalIndexFileReader#
Abstract interface for reading global index files from storage.
Public Functions
-
virtual ~GlobalIndexFileReader() = default#
-
virtual Result<std::unique_ptr<InputStream>> GetInputStream(const std::string &file_path) const = 0#
Opens an input stream for reading the specified global index file.
-
virtual ~GlobalIndexFileReader() = default#
-
struct GlobalIndexIOMeta#
Metadata describing a single file entry in a global index.
Public Functions
-
class GlobalIndexResult : public std::enable_shared_from_this<GlobalIndexResult>#
Global index result to get selected global row ids.
Subclassed by paimon::BitmapGlobalIndexResult, paimon::ScoredGlobalIndexResult
Public Functions
-
virtual ~GlobalIndexResult() = default#
-
virtual Result<bool> IsEmpty() const = 0#
Checks whether the global index result contains no matching row ids.
- Returns:
A
Result<bool>where:trueindicates the result is empty (no matching rows),falseindicates at least one matching row exists,An error is returned only if internal state is corrupted or I/O fails (e.g., during lazy loading of index data).
-
virtual Result<std::unique_ptr<Iterator>> CreateIterator() const = 0#
Creates a new iterator over the selected global row ids.
-
Result<std::vector<Range>> ToRanges() const#
Returns non-overlapping, sorted ranges covering all row ids in
GlobalIndexResult.
Computes the logical AND (intersection) between current result and another.
Computes the logical OR (union) between this result and another.
-
virtual Result<std::shared_ptr<GlobalIndexResult>> AddOffset(int64_t offset) = 0#
Adds the given offset to each row id in current result and returns the new global index result.
-
virtual std::string ToString() const = 0#
Public Static Functions
Serializes a GlobalIndexResult object into a byte array.
Note
This method only supports the following concrete implementations:
- Parameters:
global_index_result – The GlobalIndexResult instance to serialize (must not be null).
pool – Memory pool used to allocate the output byte buffer.
- Returns:
A Result containing a unique pointer to the serialized Bytes on success, or an error status on failure.
Deserializes a GlobalIndexResult object from a raw byte buffer.
Note
The concrete type of the deserialized object is determined by metadata embedded in the buffer. Currently, only the following types are supported:
- Parameters:
buffer – Pointer to the serialized byte data (must not be null).
length – Size of the buffer in bytes.
pool – Memory pool used to allocate internal objects during deserialization.
- Returns:
A Result containing a shared pointer to the reconstructed GlobalIndexResult on success, or an error status on failure.
-
class Iterator#
Iterator interface for traversing selected global row ids.
Subclassed by paimon::BitmapGlobalIndexResult::Iterator
-
virtual ~GlobalIndexResult() = default#
-
class BitmapGlobalIndexResult : public paimon::GlobalIndexResult#
Represents a global index query result that lazily materializes its matching row ids as a Roaring bitmap.
The underlying 64-bit Roaring bitmap is not constructed during object creation; instead, it is built on-demand the first time GetBitmap() is called. This design avoids unnecessary computation and memory allocation when the bitmap is not needed (e.g., during early stopping).
Public Types
-
using BitmapSupplier = std::function<Result<RoaringBitmap64>()>#
Public Functions
-
inline explicit BitmapGlobalIndexResult(BitmapSupplier bitmap_supplier)#
-
virtual Result<std::unique_ptr<GlobalIndexResult::Iterator>> CreateIterator() const override#
Creates a new iterator over the selected global row ids.
Computes the logical AND (intersection) between current result and another.
Computes the logical OR (union) between this result and another.
-
virtual Result<bool> IsEmpty() const override#
Checks whether the global index result contains no matching row ids.
- Returns:
A
Result<bool>where:trueindicates the result is empty (no matching rows),falseindicates at least one matching row exists,An error is returned only if internal state is corrupted or I/O fails (e.g., during lazy loading of index data).
-
virtual Result<std::shared_ptr<GlobalIndexResult>> AddOffset(int64_t offset) override#
Adds the given offset to each row id in current result and returns the new global index result.
-
virtual std::string ToString() const override#
-
Result<const RoaringBitmap64*> GetBitmap() const#
Note
Lazy initialization: The bitmap is constructed only on the first call to this method. Subsequent calls return the cached instance. Construction may involve non-trivial CPU/IO cost (e.g., read indexes or merging bitmap), so avoid calling this if the bitmap is not actually required. Not thread-safe.
- Returns:
A non-owning, const pointer to the bitmap. The returned pointer is valid as long as this BitmapGlobalIndexResult object is alive. The caller must not modify the bitmap.
Public Static Functions
-
static std::shared_ptr<BitmapGlobalIndexResult> FromRanges(const std::vector<Range> &ranges)#
Creates
BitmapGlobalIndexResultfor all row ids in the given ranges.Note
Overlapping or unsorted ranges are accepted.
-
class Iterator : public paimon::GlobalIndexResult::Iterator#
-
using BitmapSupplier = std::function<Result<RoaringBitmap64>()>#
-
class BitmapScoredGlobalIndexResult : public paimon::ScoredGlobalIndexResult#
Represents a scored global index result that combines a Roaring bitmap of candidate row ids with an array of associated relevance scores.
Important Ordering Note: Inheriting from ScoredGlobalIndexResult, the results are NOT sorted by score. Instead, both the bitmap and the score vector are ordered by ascending row id. This design enables efficient merging and set operations while preserving row id-to-score mapping.
Public Functions
-
inline BitmapScoredGlobalIndexResult(RoaringBitmap64 &&bitmap, std::vector<float> &&scores)#
-
virtual Result<std::unique_ptr<GlobalIndexResult::Iterator>> CreateIterator() const override#
Creates a new iterator over the selected global row ids.
-
virtual Result<std::unique_ptr<ScoredGlobalIndexResult::ScoredIterator>> CreateScoredIterator() const override#
Creates a new iterator for traversing the scored results.
Computes the logical AND (intersection) between current result and another.
Computes the logical OR (union) between this result and another.
-
virtual Result<std::shared_ptr<GlobalIndexResult>> AddOffset(int64_t offset) override#
Adds the given offset to each row id in current result and returns the new global index result.
-
virtual Result<bool> IsEmpty() const override#
Checks whether the global index result contains no matching row ids.
- Returns:
A
Result<bool>where:trueindicates the result is empty (no matching rows),falseindicates at least one matching row exists,An error is returned only if internal state is corrupted or I/O fails (e.g., during lazy loading of index data).
-
virtual std::string ToString() const override#
-
Result<const RoaringBitmap64*> GetBitmap() const#
- Returns:
A non-owning, const pointer to the bitmap. The row ids in the bitmap are stored in ascending order (as guaranteed by Roaring64 iteration).
-
const std::vector<float> &GetScores() const#
- Returns:
A const reference to a vector of float scores, where the i-th element corresponds to the i-th row id when iterating the bitmap in ascending row id order.
-
class ScoredIterator : public paimon::ScoredGlobalIndexResult::ScoredIterator#
Public Functions
-
inline ScoredIterator(const RoaringBitmap64 *bitmap, RoaringBitmap64::Iterator &&iter, const float *scores)#
-
inline virtual bool HasNext() const override#
Checks whether more row ids are available.
-
inline virtual std::pair<int64_t, float> NextWithScore() override#
Retrieves the next (row_id, score) pair and advances the iterator.
Note
The sequence is ordered by row_id, not by score.
- Returns:
A pair where:
first: the global row id (returned in ascending order),
second: the associated score computed by the index.
-
inline ScoredIterator(const RoaringBitmap64 *bitmap, RoaringBitmap64::Iterator &&iter, const float *scores)#
-
inline BitmapScoredGlobalIndexResult(RoaringBitmap64 &&bitmap, std::vector<float> &&scores)#
-
class GlobalIndexScan#
Represents a logical scan over a global index for a table.
Public Functions
-
virtual ~GlobalIndexScan() = default#
-
virtual Result<std::shared_ptr<RowRangeGlobalIndexScanner>> CreateRangeScan(const Range &range) = 0#
Creates a scanner for the global index over the specified row id range.
This method instantiates a low-level scanner that can evaluate predicates and retrieve matching row ids from the global index data corresponding to the given row id range.
- Parameters:
range – The inclusive row id range [start, end] for which to create the scanner. The range must be fully covered by existing global index data (from
GetRowRangeList()).- Returns:
A
Resultcontaining a range-level scanner, or an error if parse index meta fails.
-
virtual Result<std::vector<Range>> GetRowRangeList() = 0#
Returns row id ranges covered by this global index (sorted and non-overlapping ranges).
Each
Rangerepresents a contiguous segment of row ids for which global index data exists. This allows the query engine to parallelize scanning and be aware of ranges that are not covered by any global index.- Returns:
A
Resultcontaining sorted and non-overlappingRangeobjects.
Public Static Functions
Creates a
GlobalIndexScaninstance for the specified table and context.- Parameters:
table_path – Root directory of the table.
snapshot_id – Optional snapshot id to read from; if not provided, uses the latest.
partitions – Optional list of specific partitions to restrict the scan scope. Each map represents one partition (e.g., {“dt”: “2024-06-01”}). If omitted, scans all partitions.
options – Index-specific configuration.
file_system – File system for accessing index files. If not provided (nullptr), it is inferred from the
FILE_SYSTEMkey in theoptionsparameter.pool – Memory pool for temporary allocations; if nullptr, uses default.
- Returns:
A
Resultcontaining a unique pointer to the created scanner, or an error if initialization fails (e.g., I/O error).
Creates a
GlobalIndexScaninstance for the specified table and context.- Parameters:
partition_filters – Optional specific partition predicates.
-
virtual ~GlobalIndexScan() = default#
-
class RowRangeGlobalIndexScanner#
Interface for scanning global index data at the range level.
Public Functions
-
virtual ~RowRangeGlobalIndexScanner() = default#
-
virtual Result<std::shared_ptr<GlobalIndexReader>> CreateReader(const std::string &field_name, const std::string &index_type) const = 0#
Creates a
GlobalIndexReaderfor a specific field and index type within this range.This reader provides low-level access to the serialized index data for the given column (
field_name) and index kind (index_type, such as “bitmap”).Note
All
GlobalIndexResultobjects returned byGlobalIndexReaderuse local row ids that start from 0 — not global row ids in the entire table.- Parameters:
field_name – Name of the indexed column.
index_type – Type of the global index (e.g., “bitmap”, “lumina”).
- Returns:
A
Resultthat is:Successful with a non-null reader if the index exists and loads correctly;
Successful with a null pointer if no index was built for the given field and type;
An error only if loading fails (e.g., file corruption, I/O error, unsupported format).
-
virtual Result<std::vector<std::shared_ptr<GlobalIndexReader>>> CreateReaders(const std::string &field_name) const = 0#
Creates several
GlobalIndexReaders for a specific field within this range.- Parameters:
field_name – Name of the indexed column.
- Returns:
A
Resultthat is:Successful with several readers if the indexes exist and load correctly;
Successful with an empty vector if no index was built for the given field;
Error returns when loading fails (e.g., file corruption, I/O error, unsupported format).
-
virtual ~RowRangeGlobalIndexScanner() = default#
-
class IndexedSplit : public paimon::Split#
Indexed split for global index reading operation.
Public Functions
-
virtual std::shared_ptr<DataSplit> GetDataSplit() const = 0#
- Returns:
The underlying physical data split containing actual data file details.
-
virtual const std::vector<Range> &RowRanges() const = 0#
- Returns:
A list of row intervals [start, end] indicating which rows are relevant (e.g., passed predicate pushdown).
-
virtual const std::vector<float> &Scores() const = 0#
- Returns:
A score for each individual row included in
RowRanges(), in the order they appear when traversing the ranges.
-
virtual std::shared_ptr<DataSplit> GetDataSplit() const = 0#
-
class GlobalIndexWriteTask#
Writes a range-level global index for a specific data split and field.
Public Static Functions
Builds and writes a global index for the specified data range.
- Parameters:
table_path – Path to the table root directory where index files are stored.
field_name – Name of the indexed column (must be present in the table schema).
index_type – Type of global index to build (e.g., “bitmap”, “lumina”).
index_split – The indexed split containing the actual data (e.g., Parquet file) and The range must be fully contained within the data covered by the given
split.options – Index-specific configuration (e.g., false positive rate for bloom filters).
pool – Memory pool for temporary allocations during index construction. If
nullptr, the system’s default memory pool will be used.file_system – Specifies the file system for file operations. If
nullptr, use default file system.
- Returns:
A
Resultcontaining a shared pointer to theCommitMessagewith index metadata, or an error if indexing fails (e.g., unsupported type, I/O error).
-
class GlobalIndexWriter#
Abstract interface for building a global index from Arrow data batches.
Public Functions
-
virtual ~GlobalIndexWriter() = default#
-
virtual Status AddBatch(::ArrowArray *arrow_array) = 0#
Builds index structures from a batch of columnar data.
- Parameters:
arrow_array – A valid C ArrowArray pointer representing a struct array. Must not be nullptr, and must conform to the expected schema.
- Returns:
Status::OK()on success; otherwise, an error indicating malformed input, I/O failure, or unsupported type, etc.
-
virtual Result<std::vector<GlobalIndexIOMeta>> Finish() = 0#
Finalizes the index build process and returns metadata for persisted index.
-
virtual ~GlobalIndexWriter() = default#
-
class GlobalIndexFileWriter#
Abstract interface for writing global index files to storage.
Public Functions
-
virtual ~GlobalIndexFileWriter() = default#
-
virtual Result<std::string> NewFileName(const std::string &prefix) const = 0#
Generates a unique file name for a new index file using the given prefix.
Note
This function may be called multiple times if the index consists of multiple files.
-
virtual Result<std::unique_ptr<OutputStream>> NewOutputStream(const std::string &file_name) const = 0#
Opens a new output stream for writing index data to the specified file.
-
virtual Result<int64_t> GetFileSize(const std::string &file_name) const = 0#
Get the file size of input file name.
-
virtual std::string ToPath(const std::string &file_name) const = 0#
Get the index file path of input file name.
-
virtual ~GlobalIndexFileWriter() = default#