Iceberg Table Maintenance

Over time, Iceberg tables accumulate small data files, stale snapshots, orphaned files, and fragmented manifests — all of which slow down query planning and waste storage. SeaweedFS keeps the tables behind its built-in Iceberg REST catalog healthy with an automated maintenance worker, so you don’t need a separate compaction service or external orchestration. Any engine that queries through the catalog — Spark, Trino, Dremio, DuckDB, and more — benefits automatically.

When to use it

  • Streaming or frequent small writes — many tiny Parquet files pile up and slow scans; compaction merges them into larger ones.
  • Merge-on-read tables — position delete files accumulate and reads spend time reconciling them; delete-file rewriting consolidates them.
  • Long-lived tables — old snapshots, manifest fragmentation, and orphaned files from interrupted commits build up and need periodic cleanup.
  • You don’t want to run a separate compaction service — let SeaweedFS handle maintenance in-cluster instead of standing up external orchestration.

How to use it

Maintenance is automatic and threshold-driven — there’s nothing to schedule by hand:

  1. A detection scan runs periodically (hourly by default) across your table buckets and flags tables that have crossed configurable thresholds (too many small files, snapshots, manifests, and so on).
  2. The admin server schedules the maintenance jobs, which run on dedicated worker nodes so they never compete with your query workloads.
  3. Each operation commits a new snapshot through the catalog and bails out safely if the table head moved during planning.

Tune the thresholds (target file size, minimum input files, snapshot retention, orphan age, etc.) to match your write patterns.

Benefits

  • No separate compaction service — maintenance is built into the catalog; nothing extra to deploy or orchestrate.
  • Faster queries — fewer, larger data files and manifests cut scan and query-planning overhead.
  • Lower storage waste — expired snapshots and orphaned files are reclaimed automatically.
  • Off the query path — runs on dedicated workers, so it never competes with your engines.
  • Concurrency-safe — each operation commits through the catalog’s normal metadata path and stays consistent with concurrent reads and writes.
  • Engine-agnostic — any engine that reads through the SeaweedFS Iceberg REST catalog (Spark, Trino, Dremio, DuckDB, …) benefits with no changes.

Want the internals — the five maintenance operations, the full threshold/configuration reference, and concurrency-safety details? See the Iceberg Table Maintenance technical reference.