Version: Next

DataStream API

Overview

The Fluss DataStream Connector for Apache Flink provides a Flink DataStream source implementation for reading data from Fluss tables and a Flink DataStream sink implementation for writing data to Fluss tables. It allows you to seamlessly integrate Fluss tables with Flink's DataStream API, enabling you to process data from Fluss in your Flink applications.

Key features of the Fluss Datastream Connector include:

Reading from both primary key tables and log tables
Support for projection pushdown to select specific fields
Flexible offset initialization strategies
Custom de/serialization schemas for converting between Fluss records and your data types
Writing to both primary key tables and log tables
Support for different operation types (INSERT, UPDATE, DELETE)
Configurable sink behavior with custom options
Automatic handling of upserts for primary key tables

Dependency

In order to use the Fluss DataStream Connector in Flink DataStream API, you need to add the following dependency to your pom.xml file, according to the Flink version you are using. The Fluss DataStream Connector is available for Flink versions 1.18, 1.19, and 1.20.

Flink 1.20
Flink 1.19
Flink 1.18

<!-- https://mvnrepository.com/artifact/com.alibaba.fluss/fluss-flink-1.20 -->
<dependency>
    <groupId>com.alibaba.fluss</groupId>
    <artifactId>fluss-flink-1.20</artifactId>
    <version>0.8-SNAPSHOT</version>
</dependency>

<!-- https://mvnrepository.com/artifact/com.alibaba.fluss/fluss-flink-1.19 -->
<dependency>
    <groupId>com.alibaba.fluss</groupId>
    <artifactId>fluss-flink-1.19</artifactId>
    <version>0.8-SNAPSHOT</version>
</dependency>

<!-- https://mvnrepository.com/artifact/com.alibaba.fluss/fluss-flink-1.18 -->
<dependency>
    <groupId>com.alibaba.fluss</groupId>
    <artifactId>fluss-flink-1.18</artifactId>
    <version>0.8-SNAPSHOT</version>
</dependency>

DataStream Source

Initialization

The main entry point for the Fluss DataStream API is the FlussSource class. You create a FlussSource instance using the builder pattern, which allows for step-by-step configuration of the source connector.

// Create a FlussSource using the builder pattern
FlussSource<Order> flussSource = FlussSource.<Order>builder()
    .setBootstrapServers("localhost:9123")
    .setDatabase("mydb")
    .setTable("orders")
    .setProjectedFields("orderId", "amount")
    .setStartingOffsets(OffsetsInitializer.earliest())
    .setScanPartitionDiscoveryIntervalMs(1000L)
    .setDeserializationSchema(new OrderDeserializationSchema())
    .build();

DataStreamSource<Order> stream =
        env.fromSource(flussSource, WatermarkStrategy.noWatermarks(), "Fluss Orders Source");

stream.print();

Configuration Options

The FlussSourceBuilder provides several methods for configuring the source connector:

Required Parameters

setBootstrapServers(String bootstrapServers): Sets the bootstrap servers for the Fluss source connection
setDatabase(String database): Sets the database name for the Fluss source
setTable(String table): Sets the table name for the Fluss source
setDeserializationSchema(FlussDeserializationSchema<T> schema): Sets the deserialization schema for converting Fluss records to output records

Optional Parameters

setProjectedFields(String... projectedFieldNames): Sets the fields to project from the table (if not specified, all fields are included)
setScanPartitionDiscoveryIntervalMs(long intervalMs): Sets the interval for discovering new partitions (default: from configuration)
setStartingOffsets(OffsetsInitializer initializer): Sets the strategy for determining starting offsets (default: OffsetsInitializer.full())
setFlussConfig(Configuration flussConf): Sets custom Fluss configuration properties

Offset Initializers

The OffsetsInitializer interface provides several factory methods for creating different types of initializers:

OffsetsInitializer.earliest(): Initializes offsets to the earliest available offsets of each bucket
OffsetsInitializer.latest(): Initializes offsets to the latest offsets of each bucket
OffsetsInitializer.full(): Performs a full snapshot on the table upon first startup:
- For log tables: reads from the earliest log offset (equivalent to earliest())
- For primary key tables: reads the latest snapshot which materializes all changes on the table
OffsetsInitializer.timestamp(long timestamp): Initializes offsets based on a given timestamp

Example:

// Start reading from the earliest available offsets
FlussSource<Order> source = FlussSource.<Order>builder()
    .setStartingOffsets(OffsetsInitializer.earliest())
    // other configuration...
    .build();

// Start reading from the latest offsets
FlussSource<Order> source = FlussSource.<Order>builder()
    .setStartingOffsets(OffsetsInitializer.latest())
    // other configuration...
    .build();

// Start reading from a specific timestamp
FlussSource<Order> source = FlussSource.<Order>builder()
    .setStartingOffsets(OffsetsInitializer.timestamp(System.currentTimeMillis() - 3600 * 1000))
    // other configuration...
    .build();

Deserialization Schemas

The FlussDeserializationSchema interface is used to convert Fluss records to your desired output type. Fluss provides some built-in implementations:

RowDataDeserializationSchema - Converts Fluss records to Flink's RowData objects
JsonStringDeserializationSchema - Converts Fluss records to JSON strings

You can also implement your own deserialization schema by implementing the FlussDeserializationSchema interface:

public class OrderDeserializationSchema implements FlussDeserializationSchema<Order> {
    @Override
    public void open(InitializationContext context) throws Exception {
        // Initialization code if needed
    }

    @Override
    public Order deserialize(LogRecord record) throws Exception {
        InternalRow row = record.getRow();

        // Extract fields from the row
        long orderId = row.getLong(0);
        long itemId = row.getLong(1);
        int amount = row.getInt(2);
        String address = row.getString(3).toString();

        // Create and return your custom object
        return new Order(orderId, itemId, amount, address);
    }

    @Override
    public TypeInformation<Order> getProducedType(RowType rowSchema) {
        return TypeInformation.of(Order.class);
    }
}

Examples

Reading from a Primary Key Table

When reading from a primary key table, the Fluss DataStream Connector automatically handles updates to the data. For each update, it emits both the before and after versions of the record with the appropriate RowKind (INSERT, UPDATE_BEFORE, UPDATE_AFTER, DELETE).

// Create a FlussSource for a primary key table
FlussSource<RowData> flussSource = FlussSource.<RowData>builder()
    .setBootstrapServers("localhost:9123")
    .setDatabase("mydb")
    .setTable("orders_pk")
    .setStartingOffsets(OffsetsInitializer.earliest())
    .setDeserializationSchema(new RowDataDeserializationSchema())
    .build();

// Create a DataStream from the FlussSource
DataStreamSource<RowData> stream = env.fromSource(
    flussSource,
    WatermarkStrategy.noWatermarks(),
    "Fluss PK Source"
);

// Process the stream to handle different row kinds
// For INSERT, UPDATE_BEFORE, UPDATE_AFTER, DELETE events

Note: If you are mapping from RowData to your pojos object, you might want to include the row kind operation.

Reading from a Log Table

When reading from a log table, all records are emitted with RowKind.INSERT since log tables only support appends.

// Create a FlussSource for a log table
FlussSource<RowData> flussSource = FlussSource.<RowData>builder()
    .setBootstrapServers("localhost:9123")
    .setDatabase("mydb")
    .setTable("orders_log")
    .setStartingOffsets(OffsetsInitializer.earliest())
    .setDeserializationSchema(new RowDataDeserializationSchema())
    .build();

// Create a DataStream from the FlussSource
DataStreamSource<RowData> stream = env.fromSource(
    flussSource,
    WatermarkStrategy.noWatermarks(),
    "Fluss Log Source"
);

Using Projection Pushdown

Projection pushdown allows you to select only the fields you need, which can improve performance by reducing the amount of data transferred.

// Create a FlussSource with projection pushdown
FlussSource<OrderPartial> flussSource = FlussSource.<OrderPartial>builder()
    .setBootstrapServers("localhost:9123")
    .setDatabase("mydb")
    .setTable("orders")
    .setProjectedFields("orderId", "amount")  // Only select these fields
    .setStartingOffsets(OffsetsInitializer.earliest())
    .setDeserializationSchema(new OrderPartialDeserializationSchema())
    .build();

// Create a DataStream from the FlussSource
DataStreamSource<OrderPartial> stream = env.fromSource(
    flussSource,
    WatermarkStrategy.noWatermarks(),
    "Fluss Source with Projection"
);

In this example, OrderPartial is a class that only contains the orderId and amount fields, and OrderPartialDeserializationSchema is a deserialization schema that knows how to convert the projected fields to OrderPartial objects.

DataStream Sink

Initialization

The main entry point for the Fluss DataStream Sink API is the FlussSink class. You create a FlussSink instance using the FlussSinkBuilder, which allows for step-by-step configuration of the sink connector.

FlussSink<RowData> flussSink =
        FlussSink.<RowData>builder()
                .setBootstrapServers("localhost:9123")
                .setDatabase("mydb")
                .setTable("orders")
                .setSerializationSchema(new RowDataSerializationSchema(false, true))
                .build();

stream.sinkTo(flussSink).name("Fluss Sink");

Configuration Options

The FlussSinkBuilder provides several methods for configuring the sink connector:

Required Parameters

setBootstrapServers(String bootstrapServers): Sets the bootstrap servers for the Fluss sink connection
setDatabase(String database): Sets the database name for the Fluss sink
setTable(String table): Sets the table name for the Fluss sink
setSerializationSchema(FlussSerializationSchema<T> schema): Sets the serialization schema for converting input records to Fluss records

Optional Parameters

setShuffleByBucketId(boolean shuffleByBucketId): Sets whether to shuffle data by bucket ID (default: true)
setOption(String key, String value): Sets a single configuration option
setOptions(Map<String, String> options): Sets multiple configuration options at once

Note: FlussSerializationSchema needs to propagate downstream the operations that take place. See RowDataSerializationSchema as an example.

Examples

Writing to a Primary Key Table

When writing to a primary key table, the Fluss DataStream Connector automatically handles upserts based on the primary key.

// Create a FlussSink for a primary key table
FlussSink<Order> flussSink = FlussSink.<Order>builder()
                .setBootstrapServers("localhost:9123")
                .setDatabase("mydb")
                .setTable("orders_pk")
                .setSerializationSchema(new OrderSerializationSchema())
                .build();

// Add the sink to your DataStream
dataStream.sinkTo(flussSink);

Writing to a Log Table

When writing to a log table, all records are appended.

// Create a FlussSink for a log table
FlussSink<Order> flussSink = FlussSink.<Order>builder()
                .setBootstrapServers("localhost:9123")
                .setDatabase("mydb")
                .setTable("orders_log")
                .setSerializationSchema(new OrderSerializationSchema())
                .build();

// Add the sink to your DataStream
dataStream.sinkTo(flussSink);

Setting Custom Configuration Options

You can set custom configuration options for the Fluss sink.

// Create a FlussSink with custom configuration options
FlussSink<Order> flussSink = FlussSink.<Order>builder()
                .setBootstrapServers("localhost:9123")
                .setDatabase("mydb")
                .setTable("orders")
                .setOption("custom.key", "custom.value")
                .setSerializationSchema(new OrderSerializationSchema())
                .build();

// Or set multiple options at once
Map<String, String> options = new HashMap<>();
options.put("option1", "value1");
options.put("option2", "value2");

FlussSink<Order> flussSink = FlussSink.<Order>builder()
        .setBootstrapServers("localhost:9123")
        .setDatabase("mydb")
        .setTable("orders")
        .setOptions(options)
        .setSerializationSchema(new OrderSerializationSchema())
        .build();

Serialization Schemas

The FlussSerializationSchema interface is used to convert your data objects to Fluss's internal row format for writing to Fluss tables. Fluss provides built-in implementations:

RowDataSerializationSchema - Converts Flink's RowData objects to Fluss rows
JsonStringSerializationSchema - Converts JSON strings to Fluss rows

The serialization schema is used when writing data to Fluss tables using the Fluss sink. When configuring a Fluss sink, you provide a serialization schema that converts your data objects to Fluss's internal row format. The serialization schema is set using the setSerializationSchema() method on the sink builder.

You can implement your own serialization schema by implementing the FlussSerializationSchema interface:

private static class Order implements Serializable {
  private static final long serialVersionUID = 1L;
  private final long orderId;
  private final long itemId;
  private final int amount;
  private final String address;
  private final RowKind rowKind; // holds the row operation

  ...
}

private static class OrderSerializationSchema
            implements FlussSerializationSchema<Order> {
        private static final long serialVersionUID = 1L;

        @Override
        public void open(InitializationContext context) throws Exception {}

        @Override
        public RowWithOp serialize(Order value) throws Exception {
            GenericRow row = new GenericRow(4);
            row.setField(0, value.orderId);
            row.setField(1, value.itemId);
            row.setField(2, value.amount);
            row.setField(3, BinaryString.fromString(value.address));

            RowKind rowKind = value.rowKind;
            switch (rowKind) {
                case INSERT:
                case UPDATE_AFTER:
                    return new RowWithOp(row, OperationType.UPSERT);
                case UPDATE_BEFORE:
                case DELETE:
                    return new RowWithOp(row, OperationType.DELETE);
                default:
                    throw new IllegalArgumentException("Unsupported row kind: " + rowKind);
            }
        }
    }

By default you can use the RowDataSerializationSchema.

The RowDataSerializationSchema provides additional configuration options:

isAppendOnly - Whether the schema operates in append-only mode (only INSERT operations)
ignoreDelete - Whether to ignore DELETE and UPDATE_BEFORE operations

// Create a serialization schema for append-only operations
RowDataSerializationSchema schema = new RowDataSerializationSchema(true, false);

// Create a serialization schema that handles all operation types
RowDataSerializationSchema schema = new RowDataSerializationSchema(false, false);

// Create a serialization schema that ignores DELETE operations
RowDataSerializationSchema schema = new RowDataSerializationSchema(false, true);

Overview​

Dependency​

DataStream Source​

Initialization​

Configuration Options​

Required Parameters​

Optional Parameters​

Offset Initializers​

Deserialization Schemas​

Examples​

Reading from a Primary Key Table​

Reading from a Log Table​

Using Projection Pushdown​

DataStream Sink​

Initialization​

Configuration Options​

Required Parameters​

Optional Parameters​

Examples​

Writing to a Primary Key Table​

Writing to a Log Table​

Setting Custom Configuration Options​

Serialization Schemas​

Overview

Dependency

DataStream Source

Initialization

Configuration Options

Required Parameters

Optional Parameters

Offset Initializers

Deserialization Schemas

Examples

Reading from a Primary Key Table

Reading from a Log Table

Using Projection Pushdown

DataStream Sink

Initialization

Configuration Options

Required Parameters

Optional Parameters

Examples

Writing to a Primary Key Table

Writing to a Log Table

Setting Custom Configuration Options

Serialization Schemas