Skip to main content

Monitor Metrics

Fluss has built a metrics system to measure the behaviours of cluster and table, like the active CoordinatorServer, the number of table, the bytes written, the number of records written, etc.

Fluss supports different metric types: Counters, Gauges, Histograms, and Meters.

  • Gauge: Provides a value of any type at a point in time.
  • Counter: Used to count values by incrementing and decrementing.
  • Histogram: Measure the statistical distribution of a set of values including the min, max, mean, standard deviation and percentile.
  • Meter: The gauge exports the meter's rate.

Fluss client also has supported built-in metrics to measure operations of write to, read from fluss cluster, which can be bridged to Flink use Flink connector standard metrics.

Scope

Every metric is assigned an identifier and a set of key-value pairs under which the metric will be reported.

The identifier is delimited by metrics.scope.delimiter. Currently, the metrics.scope.delimiter is not configurable, it determined by the metric reporter. Take prometheus as example, the scope will delimited by _, so the scope like A_B_C, while Fluss metrics will always begin with fluss, as fluss_A_B_C.

The key-value pairs are called variables and are used to filter metrics. There are no restrictions on the number of order of variables. Variables are case-sensitive.

Reporter

For information on how to set up Fluss's metric reporters please take a look at the Metric Reporters page.

Metrics List

By default, Fluss provides cluster state metrics, table state metrics, and bridging to Flink connector standard metrics. This section is a reference of all these metrics.

The tables below generally feature 5 columns:

  • The "Scope" column describes which scope format is used to generate the system scope. For example, if the cell contains tabletserver then the scope format for fluss_tabletserver is used. If the cell contains multiple values, separated by a slash, then the metrics are reported multiple times for different entities, like for both tabletserver and coordinator.

  • The (optional)"Infix" column describes which infix is appended to the scope.

  • The "Metrics" column lists the names of all metrics that are registered for the given scope and infix.

  • The "Description" column provides information as to what a given metric is measuring.

  • The "Type" column describes which metric type is used for the measurement.

Thus, in order to infer the metric identifier:

  1. Take the "fluss_" first.
  2. Take the scope-format based on the "Scope" column
  3. Append the value in the "Infix" column if present, and account for the metrics.scope.delimiter setting
  4. Append metric name.
  5. One metric for prometheus will be like fluss_tabletserver_status_JVM_CPU_load

CPU

ScopeInfixMetricsDescriptionType
coordinator/tabletserverstatus_JVM_CPUloadThe recent CPU usage of the JVM.Gauge
timeThe CPU time used by the JVM.Gauge

Memory

The memory-related metrics require Oracle's memory management (also included in OpenJDK's Hotspot implementation) to be in place. Some metrics might not be exposed when using other JVM implementations (e.g. IBM's J9).

ScopeInfixMetricsDescriptionType
coordinator/tabletserverstatus_JVM_memoryheap_usedThe amount of heap memory currently used (in bytes).Gauge
heap_committedThe amount of heap memory guaranteed to be available to the JVM (in bytes).Gauge
heap_maxThe maximum amount of heap memory that can be used for memory management (in bytes).
This value might not be necessarily equal to the maximum value specified through -Xmx or the equivalent Fluss configuration parameter. Some GC algorithms allocate heap memory that won't be available to the user code and, therefore, not being exposed through the heap metrics.
Gauge
nonHeap_usedThe amount of non-heap memory currently used (in bytes).Gauge
nonHeap_committedThe amount of non-heap memory guaranteed to be available to the JVM (in bytes).Gauge
nonHeap_maxThe maximum amount of non-heap memory that can be used for memory management (in bytes).Gauge
metaspace_usedThe amount of memory currently used in the Metaspace memory pool (in bytes).Gauge
metaspace_committedThe amount of memory guaranteed to be available to the JVM in the Metaspace memory pool (in bytes).Gauge
metaspace_maxThe maximum amount of memory that can be used in the Metaspace memory pool (in bytes).Gauge
direct_countThe number of buffers in the direct buffer pool.Gauge
direct_memoryUsedThe amount of memory used by the JVM for the direct buffer pool (in bytes).Gauge
direct_totalCapacityThe total capacity of all buffers in the direct buffer pool (in bytes).Gauge
mapped_countThe number of buffers in the mapped buffer pool.Gauge
mapped_memoryUsedThe amount of memory used by the JVM for the mapped buffer pool (in bytes).Gauge
mapped_totalCapacityThe number of buffers in the mapped buffer pool (in bytes).Gauge

Threads

ScopeInfixMetricsDescriptionType
coordinator/tabletserverstatus_JVM_threadscountThe total number of live threads.Gauge

GarbageCollection

ScopeInfixMetricsDescriptionType
coordinator/tabletserverstatus_JVM_GC<Collector/all>_countThe total number of collections that have occurred for the given (or all) collector.Gauge
<Collector/all>_timeThe total time spent performing garbage collection for the given (or all) collector.Gauge
<Collector/all>_timeMsPerSecondThe time (in milliseconds) spent garbage collecting per second for the given (or all) collector.Gauge

Coordinator Server

ScopeInfixMetricsDescriptionType
coordinator-activeCoordinatorCountThe number of active CoordinatorServer in this cluster.Gauge
activeTabletServerCountThe number of active TabletServer in this cluster.Gauge
offlineBucketCountThe total number of offline buckets in this cluster.Gauge
tableCountThe total number of tables in this cluster.Gauge
bucketCountThe total number of buckets in this cluster.Gauge

Tablet Server

ScopeInfixMetricsDescriptionType
tabletServer-replicationBytesInPerSecondThe bytes of data write into follower replica for data sync.Meter
replicationBytesOutPerSecondThe bytes of data read from leader replica for data sync.Meter
leaderCountThe total number of leader replicas in this TabletServer.Gauge
replicaCountThe total number of replicas (include follower replicas) in this TabletServer.Gauge
writerIdCountThe writer id countGauge
delayedOperationsSizeThe delayed operations size in this TabletServer.Gauge

Request

ScopeInfixMetricsDescriptionType
coordinatorrequestrequestQueueSizeThe CoordinatorServer node network waiting queue size.Gauge
tabletserverrequestrequestQueueSizeThe TabletServer node network waiting queue size.Gauge
requestPerSecondThe total number of requests processed per second by the TabletServer node.Meter
requestErrorPerSecondThe total number of error requests processed per second by the TabletServer node.Meter
totalTimeMsThe total time it takes for the current TabletServer node to process a request.Histogram
requestProcessTimeMsThe time the current TabletServer node spends to process a request.Histogram
requestQueueTimeMsThe wait time spent by the request in the network waiting queue in this TabletServer node.Histogram
clientrequestbytesInPerSecondThe data bytes return from another server per second.Gauge
bytesOutPerSecondThe data bytes send from client to another server per second.Meter
requestsPerSecondThe requests count send from this client to another server per second.Meter
responsesPerSecondThe responses count return from another server per second.Meter
requestLatencyMsThe request latency.Gauge
requestsInFlightThe in flight requests count send from client to another server.Gauge

Table/Bucket

ScopeInfixMetricsDescriptionType
tabletServertablemessagesInPerSecondThe number of messages written per second to this tableMeter
bytesInPerSecondThe number of bytes written per second to this table.Meter
bytesOutPerSecondThe number of bytes read per second from this table.Meter
totalProduceLogRequestsPerSecondThe number of produce log requests to write log to this table per second.Meter
failedProduceLogRequestsPerSecondThe number of failed produce log requests to write log to this table per second.Meter
totalFetchLogRequestsPerSecondThe number of fetch log requests to read log from this table per second.Meter
failedFetchLogRequestsPerSecondThe number of failed fetch log requests to read log from this table per second.Meter
totalPutKvRequestsPerSecondThe number of put kv requests to put kv to this table per second.Meter
failedPutKvRequestsPerSecondThe number of failed put kv requests to put kv to this table per second.Meter
totalLookupRequestsPerSecondThe number of lookup requests to lookup value by key from this table per second.Meter
failedLookupRequestsPerSecondThe number of failed lookup requests to lookup value by key from this table per second.Meter
remoteLogCopyBytesPerSecondThe bytes of log data copied to remote per second.Meter
remoteLogCopyRequestsPerSecondThe number of remote log copy requests to copy local log to remote per second.Meter
remoteLogCopyErrorPerSecondThe number of error remote log copy requests to copy local log to remote per second.Meter
remoteLogDeleteRequestsPerSecondThe number of delete remote log requests to delete remote log after log ttl per second.Meter
remoteLogDeleteErrorPerSecondThe number of failed delete remote log requests to delete remote log after log ttl per second.Meter
table_bucketinSyncReplicasCountThe inSync replicas count of this table bucket.Gauge
underMinIsrIf this bucket is under min isr, this value is 1, otherwise 0.Gauge
atMinIsrIf this bucket is at min isr, this value is 1, otherwise 0.Gauge
isrExpandsPerSecondThe number of isr expands per second.Meter
isrShrinksPerSecondThe number of isr shrinks per second.Meter
failedIsrUpdatesPerSecondThe failed isr updates per second.Meter
table_bucket_lognumSegmentsThe number of segments in local storage for this table bucket.Gauge
endOffsetThe end offset in local storage for this table bucket.Gauge
sizeThe total log sizes in local storage for this table bucket.Gauge
flushPerSecondThe log flush count per second.Meter
flushLatencyMsThe log flush latency in ms.Histogram
table_bucket_remoteLognumSegmentsThe number of segments in remote storage for this table bucket.Gauge
endOffsetThe end offset in remote storage for this table bucket.Gauge
sizeThe number of bytes written per second to this table.Gauge
table_bucket_kv_snapshotlatestSnapshotSizeThe latest kv snapshot size in bytes for this table bucket.Gauge

When using Flink to read and write, Fluss has implemented some key standard Flink connector metrics to measure the source latency and output of sink, see FLIP-33: Standardize Connector Metrics. Flink source / sink metrics implemented are listed here.

How to use flink metrics, you can see flink metrics for more details.

Source Metrics

Metrics NameLevelDescriptionType
currentEmitEventTimeLagFlink Source OperatorTime difference between sending the record out of source and file creation.Gauge
currentFetchEventTimeLagFlink Source OperatorTime difference between reading the data file and file creation.Gauge

Sink Metrics

Metrics NameLevelDescriptionType
numBytesOutTableThe total number of output bytes.Counter
numBytesOutPerSecondTableThe output bytes per second.Meter
numRecordsOutTableThe total number of output records.Counter
numRecordsOutPerSecondTableThe output records per second.Meter

Grafana template

We will provide a grafana template for you to monitor fluss soon.