Read#
Interface#
-
class TableRead#
Given a
Splitor a list ofSplit, generate a reader for batch reading.Public Functions
-
virtual ~TableRead() = default#
Creates a
BatchReaderinstance for reading data.This method creates a BatchReader that will be responsible for reading data from the provided splits.
- Parameters:
splits – A vector of shared pointers to
Splitinstances representing the data to be read.- Returns:
A Result containing a unique pointer to the
BatchReaderinstance.
Creates a
BatchReaderinstance for a single split.- Parameters:
split – A shared pointer to the
Splitinstance that defines the data to be read.- Returns:
A Result containing a unique pointer to the
BatchReaderinstance.
Public Static Functions
-
static Result<std::unique_ptr<TableRead>> Create(std::unique_ptr<ReadContext> context)#
Create an instance of
TableRead.- Parameters:
context – A unique pointer to the
ReadContextused for read operations.- Returns:
A Result containing a unique pointer to the
TableReadinstance.
-
virtual ~TableRead() = default#
-
class ReadContextBuilder#
ReadContextBuilderused to build aReadContext, has input validation.Public Functions
-
explicit ReadContextBuilder(const std::string &path)#
Constructs a
ReadContextBuilderwith required parameters.- Parameters:
path – The root path of the table.
-
~ReadContextBuilder()#
-
ReadContextBuilder &SetReadSchema(const std::vector<std::string> &read_field_names)#
Set the schema fields to read from the table.
If not set, all fields from the table schema will be read. This is useful for projection pushdown to reduce I/O and improve performance by reading only the required columns.
Note
Currently supports top-level field selection. Future versions may support nested field selection using ArrowSchema for more granular projection
- Parameters:
read_field_names – Vector of field names to read from the table.
- Returns:
Reference to this builder for method chaining.
-
ReadContextBuilder &SetOptions(const std::map<std::string, std::string> &options)#
Set a configuration options map to set some option entries which are not defined in the table schema or whose values you want to overwrite.
Note
The options map will clear the options added by
AddOption()before.- Parameters:
options – The configuration options map.
- Returns:
Reference to this builder for method chaining.
-
ReadContextBuilder &AddOption(const std::string &key, const std::string &value)#
Add a single configuration option which is not defined in the table schema or whose value you want to overwrite.
If you want to add multiple options, call
AddOption()multiple times or useSetOptions()instead.- Parameters:
key – The option key.
value – The option value.
- Returns:
Reference to this builder for method chaining.
Set a predicate for filtering data during reading.
The predicate is used for both partition pruning and data filtering. It can significantly improve performance by reducing the amount of data that needs to be read and processed.
- Parameters:
predicate – Shared pointer to the predicate for data filtering.
- Returns:
Reference to this builder for method chaining.
-
ReadContextBuilder &EnablePredicateFilter(bool enabled)#
Whether to perform precise filtering according to predicates for data read from format reader.
- Parameters:
enabled – Whether to enable precise filtering (default: false)
- Returns:
Reference to this builder for method chaining.
-
ReadContextBuilder &EnablePrefetch(bool enabled)#
Enable or disable prefetching of data batches from individual files.
When enabled, the reader will prefetch multiple batches in parallel to improve throughput by overlapping I/O with computation. This is particularly beneficial for high-latency storage systems.
- Parameters:
enabled – Whether to enable prefetching (default: false)
- Returns:
Reference to this builder for method chaining.
-
ReadContextBuilder &SetPrefetchBatchCount(uint32_t batch_count)#
Set the total number of batches to prefetch across all files.
This controls the memory usage and parallelism of the prefetching mechanism. Higher values can improve throughput but consume more memory.
- Parameters:
batch_count – Total number of batches to prefetch (default: 600)
- Returns:
Reference to this builder for method chaining.
-
ReadContextBuilder &SetPrefetchMaxParallelNum(uint32_t parallel_num)#
Set the maximum number of parallel prefetch operations.
This limits the number of concurrent I/O operations to prevent overwhelming the storage system or consuming excessive system resources.
- Parameters:
parallel_num – Maximum parallel prefetch operations (default: 3)
- Returns:
Reference to this builder for method chaining.
-
ReadContextBuilder &EnableMultiThreadRowToBatch(bool enabled)#
Enable or disable multi-threaded row-to-batch conversion in merge-on-read scenarios.
When enabled, multiple threads are used to convert row data to batch format during merge operations, which can improve performance for CPU-intensive merge operations.
- Parameters:
enabled – Whether to enable multi-threaded conversion (default: false)
- Returns:
Reference to this builder for method chaining.
-
ReadContextBuilder &SetRowToBatchThreadNumber(uint32_t thread_number)#
Set the number of threads for row-to-batch conversion in merge-on-read scenarios.
This controls the parallelism of row-to-batch conversion during merge operations. Higher values can improve performance but may affect result ordering.
Note
If thread_number > 1, Arrow batches from the reader may not be in primary key order.
- Parameters:
thread_number – Number of conversion threads (default: 1)
- Returns:
Reference to this builder for method chaining.
Set custom memory pool for memory management.
Note
If not set, the default system memory pool will be used.
- Parameters:
memory_pool – The memory pool to use.
- Returns:
Reference to this builder for method chaining.
Set custom executor for task execution.
Note
If not set, the default system executor will be used.
- Parameters:
executor – The executor to use.
- Returns:
Reference to this builder for method chaining.
-
ReadContextBuilder &SetTableSchema(const std::string &table_schema)#
Set the table schema as a string to avoid schema loading I/O operations.
This optimization allows the reader to use a pre-loaded schema instead of reading it from the table metadata, which can improve performance especially in scenarios with many small read operations.
Note
The user must ensure that the schema string is valid and matches the table.
Note
If not set, the schema will be loaded from the table path.
- Parameters:
table_schema – String representation of the table schema.
- Returns:
Reference to this builder for method chaining.
-
ReadContextBuilder &WithBranch(const std::string &branch)#
Set the specific branch to read from in a versioned table.
Paimon supports branching for data versioning and time travel queries. This method allows reading from a specific branch instead of the main branch.
Note
Default branch is “main” if not specified.
- Parameters:
branch – Name of the branch to read from.
- Returns:
Reference to this builder for method chaining.
-
ReadContextBuilder &WithFileSystemSchemeToIdentifierMap(const std::map<std::string, std::string> &fs_scheme_to_identifier_map)#
Set the file system scheme to identifier mapping for custom file system configurations.
This allows using different file system implementations for different URI schemes.
Note
If not set, use default file system (configured in
Options::FILE_SYSTEM).- Parameters:
fs_scheme_to_identifier_map – Map from URI scheme to file system identifier.
- Returns:
Reference to this builder for method chaining.
-
Result<std::unique_ptr<ReadContext>> Finish()#
Build and return a
ReadContextinstance with input validation.- Returns:
Result containing the constructed
ReadContextor an error status.
-
explicit ReadContextBuilder(const std::string &path)#
-
class ReadContext#
ReadContextis some configuration for read operations.Please do not use this class directly, use
ReadContextBuilderto build aReadContextwhich has input validation.See also
Public Functions
-
~ReadContext()#
-
inline const std::string &GetPath() const#
-
inline const std::string &GetBranch() const#
-
inline const std::map<std::string, std::string> &GetFileSystemSchemeToIdentifierMap() const#
-
inline const std::map<std::string, std::string> &GetOptions() const#
-
inline const std::vector<std::string> &GetReadSchema() const#
-
inline bool EnablePredicateFilter() const#
-
inline bool EnablePrefetch() const#
-
inline uint32_t GetPrefetchBatchCount() const#
-
inline uint32_t GetPrefetchMaxParallelNum() const#
-
inline bool EnableMultiThreadRowToBatch() const#
-
inline uint32_t GetRowToBatchThreadNumber() const#
-
inline const std::optional<std::string> &GetSpecificTableSchema()#
-
inline std::shared_ptr<MemoryPool> GetMemoryPool() const#
-
~ReadContext()#