OptimizationSpec:
+ https://iceberg.apache.org/docs/1.4.2/spark-procedures/
A table that has many files
- should have those files aggregated
+ Given data
Datum(0,label_0,0,2025-05-13,2025-05-13 16:37:59.688)
Datum(1,label_1,1,2025-05-12,2025-05-13 16:37:59.888)
Datum(2,label_2,2,2025-05-11,2025-05-13 16:38:00.088)
...
+ And 20000 rows are initially written to table 'polaris.my_namespace.OptimizationSpec'
+ When we execute the SQL
CALL system.rewrite_data_files(
table => "polaris.my_namespace.OptimizationSpec",
options => map('min-input-files','2'))
+ Then the files added to the original 4 are:
/tmp/polaris/my_namespace/OptimizationSpec/data/00000-235-9cfc76b5-601d-47f0-84a6-f179131c5861-0-00001.parquet
+ And there are no files deleted from the subsequent 5
+ And these new files contain all the data
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
- should have snapshots removed when expired
+ Given there are already 5 files for table polaris.my_namespace.OptimizationSpec
+ When we execute the SQL:
CALL system.expire_snapshots(
table => "polaris.my_namespace.OptimizationSpec",
older_than => TIMESTAMP '2025-05-13 16:38:07.184',
stream_results => true)
+ Then old files have been removed and only 1 remain
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +