-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Feature request
Which Delta project/connector is this regarding?
- Spark
- Standalone
- Flink
- Kernel
- Other (fill in here)
Overview
Enable Delta Lake to delegate table scan planning (ServerSidePlanning) - including file discovery and credential provisioning - to external catalog implementations, allowing catalogs to inject temporary credentials and optimize file listing for large tables with fine-grained access control (FGAC).
Motivation
Current Limitations
Today, Delta Lake's Spark connector performs all table scan planning on the driver:
- Driver reads the transaction log to discover data files
- Driver lists all files matching the query predicate
- Driver distributes file paths to executors for reading
- Executors use credentials to directly read the table data from the storage
This architecture works when it is okay to give direct access to storage. However, in many scenarios this is not okay. For example, tables that have row-level or column-level access policies based on the user attempting to access the table. Unless the engine is fully trusted, the engine should not be given access to the raw storage location of the table. In such scenarios, the catalog may want to take the query details, plan the query in the server, and provide the specific files that need to accessed rather than the entire directory.
Proposed Solution
Introduce a ServerSidePlanning interface that allows catalog implementations to:
- Provide scan plans - Return list of data files to read, optionally filtered/optimized by catalog
- Inject temporary credentials - Provide short-lived, scoped credentials for accessing specific files
High-Level Flow
- User Query: SELECT * FROM
catalog.schema.tableWHERE col > 10 - DeltaCatalog.loadTable()
- Checks if the catalog supports ServerSidePlanning
- If yes: Calls catalog.planTableScan(table, filters)
- If no: Falls back to standard Delta scan planning
- Catalog Implementation
- Reads Delta transaction log (or uses cached metadata)
- Applies access control policies
- Generates temporary credentials
- Returns: { files: [...], credentials: {...} }
- New DSv2 Table implementation
- Receives scan plan from catalog
- Injects temporary credentials into Hadoop configuration
Implementation Progress
Below is a breakdown of the implementation tasks:
Status Legend:
- ✅ Merged
- 👀 In Review
- ☑️ Waiting to Merge
- 🔄 In Progress
- 📝 Planned
Phase 1: Core Infrastructure & Generic Pushdown Support
| Task | PR Link | Status | Notes |
|---|---|---|---|
| ServerSidePlanningClient interface and ServerSidePlannedTable DSv2 implementation • Define ServerSidePlanningClient interface for remote scan planning • Implement ServerSidePlannedTable DSv2 table that uses server-provided scan plans • Core infrastructure for ServerSide planned file discovery |
#5621 | ✅ | |
| Integrate DeltaCatalog with ServerSidePlanning • Integrate ServerSidePlanning into DeltaCatalog.loadTable() • Factory pattern with decision logic for when to use ServerSidePlanning • Tests for full query execution through DeltaCatalog |
#5622 | ✅ | |
| Metadata abstraction and factory pattern • Define metadata trait for encapsulating catalog-specific information • Implement metadata for Unity Catalog, default catalogs, and test catalogs • Factory pattern for building planning clients from metadata |
#5671 | 👀 | |
| Filter pushdown infrastructure • Add filter parameter to ServerSidePlanningClient.planScan() interface • Use Spark's Filter type as catalog-agnostic representation • Update TestServerSidePlanningClient to accept and capture filters for verification • Tests validating filters are passed through to planning client correctly |
#5672 | 👀 | |
| Projection pushdown infrastructure • Add projection parameter to ServerSidePlanningClient.planScan() interface • Use Spark's StructType as catalog-agnostic representation • Update TestServerSidePlanningClient to accept and capture projection for verification • Tests validating projection is passed through to planning client correctly |
📝 |
Phase 2: Catalog Integration & Advanced Features
| Task | PR Link | Status | Notes |
|---|---|---|---|
| Credential injection and test infrastructure • Inject temporary credentials into Hadoop configuration on executors • Tests for validating credential flow |
📝 | ||
| Add catalog server test infrastructure • HTTP server for testing catalog operations • Servlet with scan planning endpoint support • Adapter for integrating with test catalog |
📝 | ||
| Add reference catalog implementation • Catalog planning client making HTTP requests to ServerSidePlanning endpoint • Parse server's scan planning response • Integration tests with test server |
📝 | ||
| Server-side credential vending • Define credential structure for temporary credentials (S3, Azure, GCS) • Extend scan plan to include optional credentials • Extract credentials from ServerSidePlanning response |
📝 | ||
| Filter support with catalog-specific converters • Implement SupportsPushDownFilters in ServerSidePlannedScanBuilder with residual filter handling |
📝 | ||
| Projection pushdown support with catalog-specific converters • Implement SupportsPushDownRequiredColumns in ServerSidePlannedScanBuilder |
📝 |
Follow-ups / Future Work
| Description | Issue | Status |
|---|---|---|
| Test special characters in table/catalog/schema names (hyphens, etc.) - Add test coverage for edge cases in identifier handling | TBD | 📝 |
| Support additional auth such as oauth | TBD | 📝 |
| Performance analysis and improvements | TBD | 📝 |
| Metrics and observability | TBD | 📝 |
Willingness to contribute
The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?
- Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
- Yes. I can contribute this feature independently.
- No. I cannot contribute this feature at this time.