Monitor Metrics
Fluss has built a metrics system to measure the behaviours of cluster and table, like the active CoordinatorServer, the number of table, the bytes written, the number of records written, etc.
Fluss supports different metric types: Counters, Gauges, Histograms, and Meters.
Gauge
: Provides a value of any type at a point in time.Counter
: Used to count values by incrementing and decrementing.Histogram
: Measure the statistical distribution of a set of values including the min, max, mean, standard deviation and percentile.Meter
: The gauge exports the meter's rate.
Fluss client also has supported built-in metrics to measure operations of write to, read from fluss cluster, which can be bridged to Flink use Flink connector standard metrics.
Scope
Every metric is assigned an identifier and a set of key-value pairs under which the metric will be reported.
The identifier is delimited by metrics.scope.delimiter
. Currently, the metrics.scope.delimiter
is not configurable,
it determined by the metric reporter. Take prometheus as example, the scope will delimited by _
, so the scope like A_B_C
,
while Fluss metrics will always begin with fluss
, as fluss_A_B_C
.
The key-value pairs are called variables and are used to filter metrics. There are no restrictions on the number of order of variables. Variables are case-sensitive.
Reporter
For information on how to set up Fluss's metric reporters please take a look at the Metric Reporters page.
Metrics List
By default, Fluss provides cluster state metrics, table state metrics, and bridging to Flink connector standard metrics. This section is a reference of all these metrics.
The tables below generally feature 5 columns:
-
The "Scope" column describes which scope format is used to generate the system scope. For example, if the cell contains
tabletserver
then the scope format forfluss_tabletserver
is used. If the cell contains multiple values, separated by a slash, then the metrics are reported multiple times for different entities, like for bothtabletserver
andcoordinator
. -
The (optional)"Infix" column describes which infix is appended to the scope.
-
The "Metrics" column lists the names of all metrics that are registered for the given scope and infix.
-
The "Description" column provides information as to what a given metric is measuring.
-
The "Type" column describes which metric type is used for the measurement.
Thus, in order to infer the metric identifier:
- Take the "fluss_" first.
- Take the scope-format based on the "Scope" column
- Append the value in the "Infix" column if present, and account for the
metrics.scope.delimiter
setting - Append metric name.
- One metric for prometheus will be like
fluss_tabletserver_status_JVM_CPU_load
CPU
Scope | Infix | Metrics | Description | Type |
---|---|---|---|---|
coordinator/tabletserver | status_JVM_CPU | load | The recent CPU usage of the JVM. | Gauge |
time | The CPU time used by the JVM. | Gauge |
Memory
The memory-related metrics require Oracle's memory management (also included in OpenJDK's Hotspot implementation) to be in place. Some metrics might not be exposed when using other JVM implementations (e.g. IBM's J9).
Scope | Infix | Metrics | Description | Type |
---|---|---|---|---|
coordinator/tabletserver | status_JVM_memory | heap_used | The amount of heap memory currently used (in bytes). | Gauge |
heap_committed | The amount of heap memory guaranteed to be available to the JVM (in bytes). | Gauge | ||
heap_max | The maximum amount of heap memory that can be used for memory management (in bytes). This value might not be necessarily equal to the maximum value specified through -Xmx or the equivalent Fluss configuration parameter. Some GC algorithms allocate heap memory that won't be available to the user code and, therefore, not being exposed through the heap metrics. | Gauge | ||
nonHeap_used | The amount of non-heap memory currently used (in bytes). | Gauge | ||
nonHeap_committed | The amount of non-heap memory guaranteed to be available to the JVM (in bytes). | Gauge | ||
nonHeap_max | The maximum amount of non-heap memory that can be used for memory management (in bytes). | Gauge | ||
metaspace_used | The amount of memory currently used in the Metaspace memory pool (in bytes). | Gauge | ||
metaspace_committed | The amount of memory guaranteed to be available to the JVM in the Metaspace memory pool (in bytes). | Gauge | ||
metaspace_max | The maximum amount of memory that can be used in the Metaspace memory pool (in bytes). | Gauge | ||
direct_count | The number of buffers in the direct buffer pool. | Gauge | ||
direct_memoryUsed | The amount of memory used by the JVM for the direct buffer pool (in bytes). | Gauge | ||
direct_totalCapacity | The total capacity of all buffers in the direct buffer pool (in bytes). | Gauge | ||
mapped_count | The number of buffers in the mapped buffer pool. | Gauge | ||
mapped_memoryUsed | The amount of memory used by the JVM for the mapped buffer pool (in bytes). | Gauge | ||
mapped_totalCapacity | The number of buffers in the mapped buffer pool (in bytes). | Gauge |
Threads
Scope | Infix | Metrics | Description | Type |
---|---|---|---|---|
coordinator/tabletserver | status_JVM_threads | count | The total number of live threads. | Gauge |
GarbageCollection
Scope | Infix | Metrics | Description | Type |
---|---|---|---|---|
coordinator/tabletserver | status_JVM_GC | <Collector/all>_count | The total number of collections that have occurred for the given (or all) collector. | Gauge |
<Collector/all>_time | The total time spent performing garbage collection for the given (or all) collector. | Gauge | ||
<Collector/all>_timeMsPerSecond | The time (in milliseconds) spent garbage collecting per second for the given (or all) collector. | Gauge |
Coordinator Server
Scope | Infix | Metrics | Description | Type |
---|---|---|---|---|
coordinator | - | activeCoordinatorCount | The number of active CoordinatorServer in this cluster. | Gauge |
activeTabletServerCount | The number of active TabletServer in this cluster. | Gauge | ||
offlineBucketCount | The total number of offline buckets in this cluster. | Gauge | ||
tableCount | The total number of tables in this cluster. | Gauge | ||
bucketCount | The total number of buckets in this cluster. | Gauge |
Tablet Server
Scope | Infix | Metrics | Description | Type |
---|---|---|---|---|
tabletServer | - | replicationBytesInPerSecond | The bytes of data write into follower replica for data sync. | Meter |
replicationBytesOutPerSecond | The bytes of data read from leader replica for data sync. | Meter | ||
leaderCount | The total number of leader replicas in this TabletServer. | Gauge | ||
replicaCount | The total number of replicas (include follower replicas) in this TabletServer. | Gauge | ||
writerIdCount | The writer id count | Gauge | ||
delayedOperationsSize | The delayed operations size in this TabletServer. | Gauge |
Request
Scope | Infix | Metrics | Description | Type |
---|---|---|---|---|
coordinator | request | requestQueueSize | The CoordinatorServer node network waiting queue size. | Gauge |
tabletserver | request | requestQueueSize | The TabletServer node network waiting queue size. | Gauge |
requestPerSecond | The total number of requests processed per second by the TabletServer node. | Meter | ||
requestErrorPerSecond | The total number of error requests processed per second by the TabletServer node. | Meter | ||
totalTimeMs | The total time it takes for the current TabletServer node to process a request. | Histogram | ||
requestProcessTimeMs | The time the current TabletServer node spends to process a request. | Histogram | ||
requestQueueTimeMs | The wait time spent by the request in the network waiting queue in this TabletServer node. | Histogram | ||
client | request | bytesInPerSecond | The data bytes return from another server per second. | Gauge |
bytesOutPerSecond | The data bytes send from client to another server per second. | Meter | ||
requestsPerSecond | The requests count send from this client to another server per second. | Meter | ||
responsesPerSecond | The responses count return from another server per second. | Meter | ||
requestLatencyMs | The request latency. | Gauge | ||
requestsInFlight | The in flight requests count send from client to another server. | Gauge |
Table/Bucket
Scope | Infix | Metrics | Description | Type |
---|---|---|---|---|
tabletServer | table | messagesInPerSecond | The number of messages written per second to this table | Meter |
bytesInPerSecond | The number of bytes written per second to this table. | Meter | ||
bytesOutPerSecond | The number of bytes read per second from this table. | Meter | ||
totalProduceLogRequestsPerSecond | The number of produce log requests to write log to this table per second. | Meter | ||
failedProduceLogRequestsPerSecond | The number of failed produce log requests to write log to this table per second. | Meter | ||
totalFetchLogRequestsPerSecond | The number of fetch log requests to read log from this table per second. | Meter | ||
failedFetchLogRequestsPerSecond | The number of failed fetch log requests to read log from this table per second. | Meter | ||
totalPutKvRequestsPerSecond | The number of put kv requests to put kv to this table per second. | Meter | ||
failedPutKvRequestsPerSecond | The number of failed put kv requests to put kv to this table per second. | Meter | ||
totalLookupRequestsPerSecond | The number of lookup requests to lookup value by key from this table per second. | Meter | ||
failedLookupRequestsPerSecond | The number of failed lookup requests to lookup value by key from this table per second. | Meter | ||
remoteLogCopyBytesPerSecond | The bytes of log data copied to remote per second. | Meter | ||
remoteLogCopyRequestsPerSecond | The number of remote log copy requests to copy local log to remote per second. | Meter | ||
remoteLogCopyErrorPerSecond | The number of error remote log copy requests to copy local log to remote per second. | Meter | ||
remoteLogDeleteRequestsPerSecond | The number of delete remote log requests to delete remote log after log ttl per second. | Meter | ||
remoteLogDeleteErrorPerSecond | The number of failed delete remote log requests to delete remote log after log ttl per second. | Meter | ||
table_bucket | inSyncReplicasCount | The inSync replicas count of this table bucket. | Gauge | |
underMinIsr | If this bucket is under min isr, this value is 1, otherwise 0. | Gauge | ||
atMinIsr | If this bucket is at min isr, this value is 1, otherwise 0. | Gauge | ||
isrExpandsPerSecond | The number of isr expands per second. | Meter | ||
isrShrinksPerSecond | The number of isr shrinks per second. | Meter | ||
failedIsrUpdatesPerSecond | The failed isr updates per second. | Meter | ||
table_bucket_log | numSegments | The number of segments in local storage for this table bucket. | Gauge | |
endOffset | The end offset in local storage for this table bucket. | Gauge | ||
size | The total log sizes in local storage for this table bucket. | Gauge | ||
flushPerSecond | The log flush count per second. | Meter | ||
flushLatencyMs | The log flush latency in ms. | Histogram | ||
table_bucket_remoteLog | numSegments | The number of segments in remote storage for this table bucket. | Gauge | |
endOffset | The end offset in remote storage for this table bucket. | Gauge | ||
size | The number of bytes written per second to this table. | Gauge | ||
table_bucket_kv_snapshot | latestSnapshotSize | The latest kv snapshot size in bytes for this table bucket. | Gauge |
Flink connector standard metrics
When using Flink to read and write, Fluss has implemented some key standard Flink connector metrics to measure the source latency and output of sink, see FLIP-33: Standardize Connector Metrics. Flink source / sink metrics implemented are listed here.
How to use flink metrics, you can see flink metrics for more details.
Source Metrics
Metrics Name | Level | Description | Type |
---|---|---|---|
currentEmitEventTimeLag | Flink Source Operator | Time difference between sending the record out of source and file creation. | Gauge |
currentFetchEventTimeLag | Flink Source Operator | Time difference between reading the data file and file creation. | Gauge |
Sink Metrics
Metrics Name | Level | Description | Type |
---|---|---|---|
numBytesOut | Table | The total number of output bytes. | Counter |
numBytesOutPerSecond | Table | The output bytes per second. | Meter |
numRecordsOut | Table | The total number of output records. | Counter |
numRecordsOutPerSecond | Table | The output records per second. | Meter |
Grafana template
We will provide a grafana template for you to monitor fluss soon.