Catalog-Aware Table Commits #5229
jackye1995
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
There is one thing that is unique about Iceberg compared to all the other table formats - Delta, Lance, Paimon, Hudi, that is a commit in Iceberg MUST go through a catalog.
This brought convenience that a catalog has full visibility of activities in a table, regardless of how the table is used. There would not be a case that the table is updated by some process, but the catalog is not aware of that happening because the commit directly goes through the storage.
When designing Lance Namespace, we put an experimental idea of storage-managed vs implementation-managed tables based on the Iceberg design: storage-managed Lance table is directly committed against a storage, whereas a implementation-managed Lance table acts similar to an Iceberg table - the namespace implementation designs how to commit and get the true source of truth of which version of the Lance table is the latest, and that can vary across implementations (e.g. by storing a property of latest_manifest_location). By doing that, similar to Iceberg, a table has to go through UpdateTable to commit any change, giving much more control to the data infra team that typically is the table admin and owner.
Over the past few months of bouncing the idea with a few groups, I think I get 2 general feedbacks:
So here is the refreshed design I am proposing:
CommitTable, which will accept a Lance transaction to be committed against the underlying table. The operation should be able to accept all the transaction types we support: https://lance.org/format/table/transaction/#transaction-types. The expectation is that the transaction is sent to the operation to be committed, and the impl will commit following the Lance dataset storage commit protocol https://lance.org/format/table/transaction/#commit-protocol such that the new version is still numerically ordered in the_versionsdirectory.conflict_resolver.rslogic into the format specification, so we can precisely define when the server sees a pair of conflicting commit, what is the expected behavior. The server can choose to not follow exactly (e.g. some pairs of conflict are not handled and just fail), but if the server chooses to handle it, it should do it according to the spec to avoid wrong commit outcome.Now to go back to answer the question of how to ensure ALL commits go through a namespace implementation, I think the implementation will be able to do that through access control.
During credentials vending phase, the access of a writer to the Lance table should NOT allow access to write to the
_versionsdirectories in both the main root location and any branch dataset location that should be controlled. This ensure that no one can just randomly commit to the table without going through the centralizedCommitTableendpoint. andCommitTablewill be the one that does the actual commit.Beta Was this translation helpful? Give feedback.
All reactions