Server Configuration
All configurations can be set in Fluss configuration file conf/server.yaml
The configuration is parsed and evaluated when the Fluss processes are started. Changes to the configuration file require restarting the relevant processes.
Users can organize config in format key: value
, such as:
conf/server.yaml
default.bucket.number: 8
default.replication.factor: 3
remote.data.dir: /home/fluss/data
remote.fs.write-buffer-size: 10mb
auto-partition.check.interval: 5min
Server configuration refers to a set of configurations used to specify the running parameters of a server. These settings can only be configured at the time of cluster startup and do not support dynamic modification during the Fluss cluster working.
Common
Option | Type | Default | Description |
---|---|---|---|
default.bucket.number | Integer | 1 | The default number of buckets for a table in Fluss cluster. It's a cluster-level parameter and all the tables without specifying bucket number in the cluster will use the value as the bucket number. |
default.replication.factor | Integer | 1 | The default replication factor for the log of a table in Fluss cluster. It's a cluster-level parameter, and all the tables without specifying replication factor in the cluster will use the value as replication factor. |
remote.data.dir | String | (none) | The directory used for storing the kv snapshot data files and remote log for log tiered storage in a Fluss supported filesystem. |
remote.fs.write-buffer-size | MemorySize | 4kb | The default size of the write buffer for writing the local files to remote file systems. |
plugin.classloader.parent-first-patterns.default | String | java., com.alibaba.fluss., javax.annotation., org.slf4j, org.apache.log4j, org.apache.logging, org.apache.commons.logging, ch.qos.logback | A (semicolon-separated) list of patterns that specifies which classes should always be resolved through the plugin parent ClassLoader first. A pattern is a simple prefix that is checked against the fully qualified class name. This setting should generally not be modified. |
auto-partition.check.interval | Duration | 10min | The interval of auto partition check. The default value is 10 minutes. |
CoordinatorServer
Option | Type | Default | Description |
---|---|---|---|
coordinator.host | String | (None) | The config parameter defining the network address to connect to for communication with the coordinator server. If the coordinator server is used as a bootstrap server (discover all the servers in the cluster), the value of this config option should be a static hostname or address. |
coordinator.port | String | 9123 | The config parameter defining the network port to connect to for communication with the coordinator server. Like 'coordinator.host', if the coordinator server is used as a bootstrap server (discover all the servers in the cluster), the value of this config option should be a static port. Otherwise, the value can be set to "0" for a dynamic service name resolution. The value accepts a list of ports (“50100,50101”), ranges (“50100-50200”) or a combination of both. |
coordinator.io-pool.size | Integer | 1 | The size of the IO thread pool to run blocking operations for coordinator server. This includes discard unnecessary snapshot files. Increase this value if you experience slow unnecessary snapshot files clean. The default value is 1. |
TabletServer
Option | Type | Default | Description |
---|---|---|---|
tablet-server.host | String | (None) | The external address of the network interface where the TabletServer is exposed. Because different TabletServer need different values for this option, usually it is specified in an additional non-shared TabletServer-specific config file. |
tablet-server.port | String | 0 | The external RPC port where the TabletServer is exposed. |
tablet-server.id | Integer | (None) | The id for the tablet server. |
data.dir | String | /tmp/fluss-data | This configuration controls the directory where fluss will store its data. The default value is /tmp/fluss-data |
server.writer-id.expiration-time | Duration | 7d | The time that the tablet server will wait without receiving any write request from a client before expiring the related status. The default value is 7 days. |
server.writer-id.expiration-check-interval | Duration | 10min | The interval at which to remove writer ids that have expired due to 'server.writer-id.expiration-time passing. The default value is 10 minutes. |
server.background.threads | Integer | 10 | The number of threads to use for various background processing tasks. The default value is 10. |
server.buffer.memory-size | MemorySize | 256mb | The total bytes of memory the server can use, e.g, buffer write-ahead-log rows. |
server.buffer.page-size | MemorySize | 128kb | Size of every page in memory buffers ('server.buffer.memory-size'). |
server.buffer.per-request-memory-size | MemorySize | 16mb | The minimum number of bytes that will be allocated by the writer rounded down to the closest multiple of server.buffer.page-size. It must be greater than or equal to server.buffer.page-size. This option allows to allocate memory in batches to have better CPU-cached friendliness due to contiguous segments. |
server.buffer.wait-timeout | Duration | 2^(63)-1ns | Defines how long the buffer pool will block when waiting for segments. |
Zookeeper
Option | Type | Default | Description |
---|---|---|---|
zookeeper.address | String | (None) | The ZooKeeper address to use, when running Fluss with ZooKeeper. |
zookeeper.path.root | String | /fluss | The root path under which Fluss stores its entries in ZooKeeper. |
zookeeper.client.session-timeout | Duration | 60s | Defines the session timeout for the ZooKeeper session in ms. |
zookeeper.client.connection-timeout | Duration | 15s | Defines the connection timeout for ZooKeeper in ms. |
zookeeper.client.retry-wait | Duration | 5s | Defines the pause between consecutive retries in ms. |
zookeeper.client.max-retry-attempts | Integer | 3 | Defines the number of connection retries before the client gives up. |
zookeeper.client.tolerate-suspended-connections | Boolean | false | Defines whether a suspended ZooKeeper connection will be treated as an error that causes the leader information to be invalidated or not. In case you set this option to %s, Fluss will wait until a ZooKeeper connection is marked as lost before it revokes the leadership of components. This has the effect that Fluss is more resilient against temporary connection instabilities at the cost of running more likely into timing issues with ZooKeeper. |
zookeeper.client.ensemble-tracker | Boolean | true | Defines whether Curator should enable ensemble tracker. This can be useful in certain scenarios in which CuratorFramework is accessing to ZK clusters via load balancer or Virtual IPs. Default Curator EnsembleTracking logic watches CuratorEventType.GET_CONFIG events and changes ZooKeeper connection string. It is not desired behaviour when ZooKeeper is running under the Virtual IPs. Under certain configurations EnsembleTracking can lead to setting of ZooKeeper connection string with unresolvable hostnames. |
Netty
Option | Type | Default | Description |
---|---|---|---|
netty.server.num-network-threads | Integer | 3 | The number of threads that the server uses for receiving requests from the network and sending responses to the network. |
netty.server.num-worker-threads | Integer | 8 | The number of threads that the server uses for processing requests, which may include disk and remote I/O. |
netty.server.max-queued-requests | Integer | 500 | The number of queued requests allowed for worker threads, before blocking the I/O threads. |
netty.connection.max-idle-time | Duration | 10min | Close idle connections after the number of milliseconds specified by this config. |
netty.client.num-network-threads | Integer | 1 | The number of threads that the client uses for sending requests to the network and receiving responses from network. The default value is 1 |
Log
Option | Type | Default | Description |
---|---|---|---|
log.segment.file-size | MemorySize | 1024m | This configuration controls the segment file size for the log. Retention and cleaning is always done a file at a time so a larger segment size means fewer files but less granular control over retention. |
log.index.file-size | MemorySize | 10m | This configuration controls the size of the index that maps offsets to file positions. We preallocate this index file and shrink it only after log rolls. You generally should not need to change this setting. |
log.index.interval-size | MemorySize | 4k | This setting controls how frequently fluss adds an index entry to its offset index. The default setting ensures that we index a message roughly every 4096 bytes. More indexing allows reads to jump closer to the exact position in the log but makes the index larger. You probably don't need to change this. |
log.file-preallocate | Boolean | false | True if we should preallocate the file on disk when creating a new log segment. |
log.flush.interval-messages | Long | Long.MAX_VALUE | This setting allows specifying an interval at which we will force a fsync of data written to the log. For example if this was set to 1, we would fsync after every message; if it were 5 we would fsync after every five messages. |
log.replica.high-watermark.checkpoint-interval | Duration | 5s | The frequency with which the high watermark is saved out to disk. The default setting is 5 seconds. |
log.replica.max-lag-time | Duration | 30s | If a follower replica hasn't sent any fetch log requests or hasn't consumed up the leaders log end offset for at least this time, the leader will remove the follower replica form isr |
log.replica.write-operation-purge-number | Integer | 1000 | The purge number (in number of requests) of the write operation manager, the default value is 1000. |
log.replica.fetch-operation-purge-number | Integer | 1000 | The purge number (in number of requests) of the fetch log operation manager, the default value is 1000. |
log.replica.fetcher-number | Integer | 1 | Number of fetcher threads used to replicate log records from each source tablet server. The total number of fetchers on each tablet server is bound by this parameter multiplied by the number of tablet servers in the cluster. Increasing this value can increase the degree of I/O parallelism in the follower and leader tablet server at the cost of higher CPU and memory utilization. |
log.replica.fetch.backoff-interval | Duration | 1s | The amount of time to sleep when fetch bucket error occurs. |
log.replica.fetch.max-bytes | MemorySize | 16mb | The maximum amount of data the server should return for a fetch request from follower. Records are fetched in batches, and if the first record batch in the first non-empty bucket of the fetch is larger than this value, the record batch will still be returned to ensure that the fetch can make progress. As such, this is not a absolute maximum. Note that the fetcher performs multiple fetches in parallel. |
log.replica.fetch.max-bytes-for-bucket | MemorySize | 1mb | The maximum amount of data the server should return for a table bucket in fetch request fom follower. Records are fetched in batches, and the max bytes size is config by this option. |
log.replica.fetch.min-bytes | MemorySize | 1b | The minimum bytes expected for each fetch log request from the follower to response. If not enough bytes, wait up to log.replica.fetch-wait-max-time time to return. |
log.replica.fetch.wait-max-time | Duration | 500ms | The maximum time to wait for enough bytes to be available for a fetch log request from the follower to response. This value should always be less than the 'log.replica.max-lag-time' at all times to prevent frequent shrinking of ISR for low throughput tables |
log.replica.min-in-sync-replicas-number | Integer | 1 | When a producer set acks to all (-1), this configuration specifies the minimum number of replicas that must acknowledge a write for the write to be considered successful. If this minimum cannot be met, then the producer will raise an exception(NotEnoughReplicas). when used together, this config and 'acks' allow you to enforce greater durability guarantees. A typical scenario would be to create a table with a replication factor of 3. set this conf to 2, and produce with acks = -1. This will ensure that the producer raises an exception if a majority of replicas don't receive a write. |
Log Tiered Storage
Option | Type | Default | Description |
---|---|---|---|
remote.log.task-interval-duration | Duration | 1min | Interval at which remote log manager runs the scheduled tasks like copy segments, clean up remote log segments, delete local log segments etc. If the value is set to 0s, it means that the remote log storage is disabled. |
remote.log.index-file-cache-size | MemorySize | 1gb | The total size of the space allocated to store index files fetched from remote storage in the local storage. |
remote.log-manager.thread-pool-size | Integer | 4 | Size of the thread pool used in scheduling tasks to copy segments, fetch remote log indexes and clean up remote log segments. |
remote.log.data-transfer-thread-num | Integer | 4 | The number of threads the server uses to transfer (download and upload) remote log file can be data file, index file and remote log metadata file. |
Kv
Option | Type | Default | Description |
---|---|---|---|
kv.snapshot.interval | Duration | 10min | The interval to perform periodic snapshot for kv data. The default setting is 10 minutes. |
kv.snapshot.scheduler-thread-num | Integer | 1 | The number of threads that the server uses to schedule snapshot kv data for all the replicas in the server. |
kv.snapshot.transfer-thread-num | Integer | 4 | The number of threads the server uses to transfer (download and upload) kv snapshot files. |
kv.snapshot.num-retained | Integer | 1 | The maximum number of completed snapshots to retain. |
kv.rocksdb.thread.num | Integer | 2 | The maximum number of concurrent background flush and compaction jobs (per bucket of table). The default value is '2'. |
kv.rocksdb.files.open | Integer | -1 | The maximum number of open files (per bucket of table) that can be used by the DB, '-1' means no limit. The default value is '-1'. |
kv.rocksdb.log.max-file-size | MemorySize | 25mb | The maximum size of RocksDB's file used for information logging. If the log files becomes larger than this, a new file will be created. If 0, all logs will be written to one log file. The default maximum file size is '25MB'. |
kv.rocksdb.log.file-num | Integer | 4 | The maximum number of files RocksDB should keep for information logging (Default setting: 4). |
kv.rocksdb.log.dir | String | (None) | The directory for RocksDB's information logging files. If empty (Fluss default setting), log files will be in the same directory as the Fluss log. If non-empty, this directory will be used and the data directory's absolute path will be used as the prefix of the log file name. If setting this option as a non-existing location, e.g '/dev/null', RocksDB will then create the log under its own database folder as before. |
kv.rocksdb.log.level | Enum | INFO_LEVEL | The specified information logging level for RocksDB. Candidate log level is 'DEBUG_LEVEL', 'INFO_LEVEL', 'WARN_LEVEL', 'ERROR_LEVEL', 'FATAL_LEVEL', 'HEADER_LEVEL', NUM_INFO_LOG_LEVELS, . If unset, Fluss will use INFO_LEVEL. Note: RocksDB info logs will not be written to the Fluss's tablet server logs and there is no rolling strategy, unless you configure 'kv.rocksdb.log.dir', 'kv.rocksdb.log.max-file-size' and 'kv.rocksdb.log.file-num' accordingly. Without a rolling strategy, it may lead to uncontrolled disk space usage if configured with increased log levels! There is no need to modify the RocksDB log level, unless for troubleshooting RocksDB. |
kv.rocksdb.write-batch-size | MemorySize | 2mb | The max size of the consumed memory for RocksDB batch write, will flush just based on item count if this config set to 0. |
kv.rocksdb.compaction.style | Enum | LEVEL | The specified compaction style for DB. Candidate compaction style is LEVEL, FIFO, UNIVERSAL, or NONE, and Fluss chooses 'LEVEL' as default style. |
kv.rocksdb.compaction.level.use-dynamic-size | Boolean | false | If true, RocksDB will pick target size of each level dynamically. From an empty DB, RocksDB would make last level the base level, which means merging L0 data into the last level, until it exceeds max_bytes_for_level_base. And then repeat this process for second last level and so on. The default value is 'false'. For more information, please refer to %s https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#level_compaction_dynamic_level_bytes-is-true RocksDB's doc. |
kv.rocksdb.compression.per.level | Enum | LZ4,LZ4,LZ4,LZ4,LZ4,ZSTD,ZSTD | A comma-separated list of Compression Type. Different levels can have different compression policies. In many cases, lower levels use fast compression algorithms, while higher levels with more data use slower but more effective compression algorithms. The N th element in the List corresponds to the compression type of the level N-1 When 'kv.rocksdb.compaction.level.use-dynamic-size' is true, compression_per_level[0] still determines L0, but other elements are based on the base level and may not match the level seen in the info log. Note: If the List size is smaller than the level number, the undefined lower level uses the last Compression Type in the List. The optional values include NO, SNAPPY, LZ4, ZSTD. For more information about compression type, please refer to doc https://github.com/facebook/rocksdb/wiki/Compression. The default value is ‘LZ4,LZ4,LZ4,LZ4,LZ4,ZSTD,ZSTD’, indicates there is lz4 compaction of level0 and level4, ZSTD compaction algorithm is used from level5 to level6. LZ4 is a lightweight compression algorithm so it usually strikes a good balance between space and CPU usage. ZSTD is more space save than LZ4, but it is more CPU-intensive. Different machines deploy compaction modes according to CPU and I/O resources. The default value is for the scenario that CPU resources are adequate. If you find the IO pressure of the system is not big when writing a lot of data, but CPU resources are inadequate, you can exchange I/O resources for CPU resources and change the compaction mode to 'NO,NO,NO,LZ4,LZ4,ZSTD,ZSTD'. |
kv.rocksdb.compaction.level.target-file-size-base | MemorySize | 64mb | The target file size for compaction, which determines a level-1 file size. The default value is '64MB'. |
kv.rocksdb.compaction.level.max-size-level-base | MemorySize | 256mb | The upper-bound of the total size of level base files in bytes. The default value is '256MB'. |
kv.rocksdb.writebuffer.size | MemorySize | 64mb | The amount of data built up in memory (backed by an unsorted log on disk) before converting to a sorted on-disk files. The default writebuffer size is '64MB'. |
kv.rocksdb.writebuffer.count | Integer | 2 | The maximum number of write buffers that are built up in memory. The default value is '2'. |
kv.rocksdb.writebuffer.number-to-merge | Integer | 1 | The minimum number of write buffers that will be merged together before writing to storage. The default value is '1'. |
kv.rocksdb.block.blocksize | MemorySize | 4kb | The approximate size (in bytes) of user data packed per block. The default blocksize is '4KB'. |
kv.rocksdb.block.cache-size | MemorySize | 8mb | The amount of the cache for data blocks in RocksDB. The default block-cache size is '8MB'. |
kv.rocksdb.use-bloom-filter | Boolean | true | If true, every newly created SST file will contain a Bloom filter. It is enabled by default. |
kv.rocksdb.bloom-filter.bits-per-key | Double | 10.0 | Bits per key that bloom filter will use, this only take effect when bloom filter is used. The default value is 10.0. |
kv.rocksdb.bloom-filter.block-based-mode | Boolean | false | If true, RocksDB will use block-based filter instead of full filter, this only take effect when bloom filter is used. The default value is 'false'. |
kv.recover.log-record-batch.max-size | MemorySize | 16mb | The max fetch size for fetching log to apply to kv during recovering kv. |
Metrics
Option | Type | Default | Description |
---|---|---|---|
metrics.reporters | List | (None) | An optional list of reporter names. If configured, only reporters whose name matches in the list will be started |
metrics.reporter.prometheus.port | String | 9249 | The port the Prometheus reporter listens on. In order to be able to run several instances of the reporter on one host (e.g. when one TabletServer is colocated with the CoordinatorServer) it is advisable to use a port range like 9250-9260. |
metrics.reporter.jmx.port | String | (None) | The port for the JMXServer that JMX clients can connect to. If not set, the JMXServer won't start. In order to be able to run several instances of the reporter on one host (e.g. when one TabletServer is colocated with the CoordinatorServer) it is advisable to use a port range like 9990-9999. |
Lakehouse
Option | Type | Default | Description |
---|---|---|---|
datalake.format | ENUM | (None) | The datalake format used by of Fluss to be as lakehouse storage, such as Paimon, Iceberg, Hudi. Now, only support Paimon. |