Skip to content

[Feature Request][Spark] Support ServerSide Table Scan Planning for Fine-Grained Access Control #5623

@murali-db

Description

@murali-db

Feature request

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Overview

Enable Delta Lake to delegate table scan planning (ServerSidePlanning) - including file discovery and credential provisioning - to external catalog implementations, allowing catalogs to inject temporary credentials and optimize file listing for large tables with fine-grained access control (FGAC).

Motivation

Current Limitations

Today, Delta Lake's Spark connector performs all table scan planning on the driver:

  1. Driver reads the transaction log to discover data files
  2. Driver lists all files matching the query predicate
  3. Driver distributes file paths to executors for reading
  4. Executors use credentials to directly read the table data from the storage

This architecture works when it is okay to give direct access to storage. However, in many scenarios this is not okay. For example, tables that have row-level or column-level access policies based on the user attempting to access the table. Unless the engine is fully trusted, the engine should not be given access to the raw storage location of the table. In such scenarios, the catalog may want to take the query details, plan the query in the server, and provide the specific files that need to accessed rather than the entire directory.

Proposed Solution

Introduce a ServerSidePlanning interface that allows catalog implementations to:

  1. Provide scan plans - Return list of data files to read, optionally filtered/optimized by catalog
  2. Inject temporary credentials - Provide short-lived, scoped credentials for accessing specific files

High-Level Flow

  1. User Query: SELECT * FROM catalog.schema.table WHERE col > 10
  2. DeltaCatalog.loadTable()
    • Checks if the catalog supports ServerSidePlanning
    • If yes: Calls catalog.planTableScan(table, filters)
    • If no: Falls back to standard Delta scan planning
  3. Catalog Implementation
    • Reads Delta transaction log (or uses cached metadata)
    • Applies access control policies
    • Generates temporary credentials
    • Returns: { files: [...], credentials: {...} }
  4. New DSv2 Table implementation
    • Receives scan plan from catalog
    • Injects temporary credentials into Hadoop configuration

Implementation Progress

Below is a breakdown of the implementation tasks:

Status Legend:

  • ✅ Merged
  • 👀 In Review
  • ☑️ Waiting to Merge
  • 🔄 In Progress
  • 📝 Planned

Phase 1: Core Infrastructure & Generic Pushdown Support

Task PR Link Status Notes
ServerSidePlanningClient interface and ServerSidePlannedTable DSv2 implementation
• Define ServerSidePlanningClient interface for remote scan planning
• Implement ServerSidePlannedTable DSv2 table that uses server-provided scan plans
• Core infrastructure for ServerSide planned file discovery
#5621
Integrate DeltaCatalog with ServerSidePlanning
• Integrate ServerSidePlanning into DeltaCatalog.loadTable()
• Factory pattern with decision logic for when to use ServerSidePlanning
• Tests for full query execution through DeltaCatalog
#5622
Metadata abstraction and factory pattern
• Define metadata trait for encapsulating catalog-specific information
• Implement metadata for Unity Catalog, default catalogs, and test catalogs
• Factory pattern for building planning clients from metadata
#5671 👀
Filter pushdown infrastructure
• Add filter parameter to ServerSidePlanningClient.planScan() interface
• Use Spark's Filter type as catalog-agnostic representation
• Update TestServerSidePlanningClient to accept and capture filters for verification
• Tests validating filters are passed through to planning client correctly
#5672 👀
Projection pushdown infrastructure
• Add projection parameter to ServerSidePlanningClient.planScan() interface
• Use Spark's StructType as catalog-agnostic representation
• Update TestServerSidePlanningClient to accept and capture projection for verification
• Tests validating projection is passed through to planning client correctly
📝

Phase 2: Catalog Integration & Advanced Features

Task PR Link Status Notes
Credential injection and test infrastructure
• Inject temporary credentials into Hadoop configuration on executors
• Tests for validating credential flow
📝
Add catalog server test infrastructure
• HTTP server for testing catalog operations
• Servlet with scan planning endpoint support
• Adapter for integrating with test catalog
📝
Add reference catalog implementation
• Catalog planning client making HTTP requests to ServerSidePlanning endpoint
• Parse server's scan planning response
• Integration tests with test server
📝
Server-side credential vending
• Define credential structure for temporary credentials (S3, Azure, GCS)
• Extend scan plan to include optional credentials
• Extract credentials from ServerSidePlanning response
📝
Filter support with catalog-specific converters
• Implement SupportsPushDownFilters in ServerSidePlannedScanBuilder with residual filter handling
📝
Projection pushdown support with catalog-specific converters
• Implement SupportsPushDownRequiredColumns in ServerSidePlannedScanBuilder
📝

Follow-ups / Future Work

Description Issue Status
Test special characters in table/catalog/schema names (hyphens, etc.) - Add test coverage for edge cases in identifier handling TBD 📝
Support additional auth such as oauth TBD 📝
Performance analysis and improvements TBD 📝
Metrics and observability TBD 📝

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

  • Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
  • Yes. I can contribute this feature independently.
  • No. I cannot contribute this feature at this time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions