Memory Format#
Paimon Java uses a row-level
abstraction, InternalRow, as the default data format interface.
Row-level interfaces are generally more intuitive for programming and can be
easily integrated into data processing pipelines such as filtering and merge sorting.
However, this approach introduces conversion and access overhead,
making it difficult to fully leverage the additional performance benefits provided
by modern CPU SIMD vectorization.
Considering that the C++ implementation focuses more on end-to-end performance, and that underlying data file formats (e.g., ORC, Parquet) are predominantly columnar, providing a columnar-centric data abstraction can minimize overhead during data flow, integrate more naturally with vectorized execution engines, and ultimately deliver superior overall performance.
Why Apache Arrow#
Apache Arrow is currently the most widely adopted in-memory columnar format and has strong native support for Parquet and ORC. It is well-integrated across open-source engines including Spark, Pandas, Drill, Impala, and Velox.
Overall, using Apache Arrow as the in-memory format for Paimon C++ allows us to:
Maximize columnar performance and efficient data access patterns.
Seamlessly integrate with the Apache Arrow ecosystem and tooling.
Benefit from mature interoperability with popular data systems and formats.
Versioning and Dependency Concerns#
One important consideration is that Arrow is an active open-source project with a broad and evolving C++ surface area that spans data formats, compute kernels, and file I/O. Due to frequent releases and a large API surface, different Arrow C++ SDK versions can introduce API incompatibilities.
If Paimon C++ directly depends on the full Arrow C++ SDK, it may conflict with existing Arrow C++ dependencies in other compute engines, raising integration costs and increasing long-term maintenance complexity.
Adopting the Arrow C Data Interface#
To leverage Arrow’s performance and ecosystem benefits while avoiding tight coupling to specific Arrow C++ SDK versions, we use the Arrow C Data Interface as the default in-memory format for Paimon C++.
Key advantages:
Version-neutral: The C Data Interface is designed to be stable and forward-compatible across Arrow versions.
Compiler-neutral: C language interfaces avoid ABI friction commonly seen with C++ compilers and standard libraries.
Broad interoperability: The C Data Interface is supported by Arrow-based systems and enables zero-copy or minimal-copy interchange of columnar data.
Design Principles#
Columnar-first abstraction: Paimon C++ will represent in-memory data using columnar buffers and schemas compatible with the Arrow C Data Interface, minimizing transformation overhead.
Minimal dependency footprint: Prefer stable C interfaces and lightweight utility layers; avoid linking against the full Arrow C++ SDK unless strictly necessary and well-isolated.
Vectorization-aware execution: Structure data layouts to align with SIMD processing (e.g., contiguous buffers, clear null bitmaps and type-specific arrays), enabling efficient filtering, projection, and aggregation.
Interoperability: Ensure that data produced and consumed by Paimon C++ can be handed off to Arrow-compatible engines and libraries without expensive conversions.
Compatibility with columnar storage: Maintain efficient paths from columnar file formats (Parquet, ORC) to in-memory columnar representations, minimizing decoding and marshaling overhead.
Implementation Outline#
Schema and buffers#
Represent schemas and arrays using Arrow C Data Interface types (e.g.,
ArrowSchema,ArrowArray) with clear ownership and lifecycle.Support nested types (structs, lists, maps) and common primitives (integers, floats, decimals, timestamps).
Memory management#
Define consistent ownership semantics for buffers and child arrays.
Employ reference counting or explicit release callbacks aligned with the Arrow C conventions.
Nullability and validity#
Use standard validity bitmaps for nullability and adhere to Arrow’s canonical buffer organization (validity, offsets, data, etc.).
Conversion boundaries#
Provide adapters to * Read from columnar file formats (Parquet/ORC) into Arrow-compatible
ArrowArraystructures. * Export/import data to other Arrow-compatible engines with zero or minimal copies.Keep these adapters independent from heavy Arrow C++ SDK dependencies.