Data Cleanup#
Note
The entire data cleanup feature supports only append tables, and cleanup of index manifests, changelog, statistics are not supported. Do not use this feature with primary key tables.
This document describes three cleanup capabilities:
Orphan File Cleanup#
Description#
Orphan file cleanup currently supports only append tables.
It runs as an independent task: construct a
CleanContextand launch anOrphanFilesCleaner.
Detailed Steps#
List all Paimon-specific subdirectories under the table directory (e.g.,
manifest/,snapshot/,f1=10/bucket-0, …).Based on the subdirectories from step 1, enumerate all Paimon files in the table directory.
Using snapshot information, determine all in-use
manifestfiles and data files.Compute the set of files that appear in step 2 but not in step 3 (i.e., orphan files). Among those, delete files whose modification time is earlier than
older_than_ms.
Performance Considerations#
Steps 1, 2, and 4 use an executor to parallelize I/O operations.
Orphan file cleanup may take a long time; you can pass an executor with more threads to accelerate the process.
TODO
Support cleanup of index manifests, changelog, statistics, etc.
Expiring Partitions#
Description#
Executed within the Commit task. First construct a
FileStoreCommit.DropPartitionuses a mark-delete strategy. CallingDropPartitionwill send anOverwritemessage, marking all data files in the specified partition asDELETE.A new snapshot is committed afterward. Actual deletion of data files occurs as snapshots expire.
Detailed Steps#
Build a
ScanFilterfor the partition and use the latest snapshot to scan the partition.Iterate over the scanned data file list (
ManifestEntries) and rewrite each entry’s type toDELETE.Commit using the rewritten
ManifestEntries. If the commit fails, retry a limited number of times.
Expiring Snapshots#
Description#
Executed within the Commit task. First construct a
FileStoreCommit.The following optional configuration parameters control snapshot expiration:
snapshot.num-retained.min: The minimum number of completed snapshots to retain (>= 1). Default: 10.snapshot.num-retained.max: The maximum number of completed snapshots to retain (>=snapshot.num-retained.min). Default:int32max value.snapshot.time-retained: The maximum age of completed snapshots to retain. Default: 1 hour.snapshot.expire.limit: The maximum number of snapshots allowed to expire at a time. Default: 10.
The snapshot expiration interface deletes data files according to the expiration policy and returns the number of snapshots deleted.
Detailed Steps#
Use
snapshot_managerto findearliest_snapshot_idandlatest_snapshot_id.Based on
earliest/latest_snapshot_idand the expire config, determine the range of snapshots to clean.
Note
Consumer subscription management (consumer manager) is not currently supported. Users must ensure that snapshots to be expired are not in use.
Verify that the snapshot range is continuous. Normally, it is continuous. If a snapshot is missing, assume earlier snapshots were already cleaned and the missing files are orphaned remnants due to I/O exceptions; they are out of scope for this cleanup.
Clean data files for the updated expiration range.
To decide whether a file from a snapshot should be deleted, check if it was marked
DELETEin the delta of the subsequent snapshot.For an expiration range
[begin, end), iterate over(begin,end]and delete data files whose type isDELETEin eachsnapshot.DeltaManifestList().If a file underwent multiple
ADDandDELETEoperations, deletion follows the operation order:ADDthenDELETE→ the file is deleted.DELETEthenADD→ the initialDELETEdoes not apply (file did not exist yet); the subsequentADDensures the file remains.
Clean meta files: - Preserve manifests used by the last snapshot in the cleanup range (
end_exclusive_id). - Delete manifest files used by snapshots frombegin_inclusive_idtoend_exclusive_id(exclusive) and delete the snapshot files themselves.Rewrite
EarliestHinttoend_exclusive_id.Return the number of snapshots deleted, i.e.,
end_exclusive_id - begin_inclusive_id.
Performance Considerations#
Step 4 uses an executor to parallelize file deletions.
If deletion is slow, pass an executor with more threads to accelerate the process.
TODO
Preserve tag (savepoint) data via
tagManager.Delete changelog files.
Remove empty directories.