Skip to content

Conversation

@adriangb
Copy link
Contributor

@adriangb adriangb commented Jun 27, 2025

Adds predicate adaptation to fill in missing struct fields with null, i.e. struct_col.field_missing_in_file -> null.

@adriangb
Copy link
Contributor Author

Marking as draft until I have time do a review myself. @kosiew feel free to take a look if you have time but I expect this needs work before it's ready.

@adriangb adriangb marked this pull request as draft June 27, 2025 17:25
@adriangb adriangb self-assigned this Jun 27, 2025
@github-actions github-actions bot added physical-expr Changes to the physical-expr crates datasource Changes to the datasource crate labels Jun 27, 2025
Cargo.toml Outdated
datafusion-macros = { path = "datafusion/macros", version = "48.0.0" }
datafusion-optimizer = { path = "datafusion/optimizer", version = "48.0.0", default-features = false }
datafusion-physical-expr = { path = "datafusion/physical-expr", version = "48.0.0", default-features = false }
datafusion-physical-expr-adapter = { path = "datafusion/physical-expr-adapter", version = "48.0.0", default-features = false }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to make a new package because physical-expr does not have access to functions-nested

@adriangb adriangb requested a review from kosiew June 27, 2025 19:03
@adriangb adriangb marked this pull request as ready for review June 27, 2025 20:35
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @adriangb -- I am not sure about the approach of using struct and get_field

@kosiew do you perhaps have some time to review this PR as well?

};
use datafusion_physical_expr_common::physical_expr::PhysicalExpr;

/// Build a struct expression by recursively extracting and rewriting fields from a source struct.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an example of what this is doing?

It seems like if the table schema is like {a: int, b:int} and the file only has {a: int} this function will build up an expression like

struct(a, source.a, b, null)

Something about relying on struct and field extraction feels very wrong to me. Among other things now this casting may not be consistent with what happens when someone tries to call CAST(..) manually

Is it the case that build_struct_expr would be unecessary if cast had native support for casting structs to structs and filling in nulls 🤔

How about this for an alternate idea:

  1. Add a cast wrapper in datafusion with the same signature as arrow-rs's cast
  2. If the arguments were both Struct types, apply the special DataFusion casting logic(fill in missing fields with nulls)
  3. otherwise just call into the existing arrow cast kernel

That would

  1. Avoid the need to call the struct and get_field functions
  2. Allow other parts of the system, like coercsion, and SQL, to have the same semantics

We have talked about adding cast for evolving structs in arrow-rs for a while but it seems the consensus was to put it in DataFusion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something about relying on struct and field extraction feels very wrong to me. Among other things now this casting may not be consistent with what happens when someone tries to call CAST(..) manually

Does it have to match? These seem like two similar but not exactly identical use cases to me.

How about this for an alternate idea:

That seems reasonable to me.

Avoid the need to call the struct and get_field functions

I am probably missing something but thinking about it to me it didn't seem like those calls would be much of a problem. Either way a new struct column is going to have to be built and the existing struct column is going to have to be read from disk in its entirety.

But the pushback made me think that maybe we should be doing something else here: we should be inspecting the expression more deeply and in the case of col.field if field is not present in the physical schema we replace the whole get_field(col, 'field') expression with null instead of get_col(struct('other', get_field(col, 'other'), 'field', null))) which is basically what is happening now. Then we avoid reading the column altogether. We can do something similar with casts. I think this is complimentary to your proposal because the fallthrough case for select struct_col can call cast if needed like any other column, and our internal cast implementation can do the "smart things" with structs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something about relying on struct and field extraction feels very wrong to me. Among other things now this casting may not be consistent with what happens when someone tries to call CAST(..) manually

Does it have to match? These seem like two similar but not exactly identical use cases to me.

I worry that if they don't match we'll have potentially divergent behavior which could be confusing to users.

Also another use case is when doing stuff like coercing structs (so for example you could compare two struct columns for equality or UNION them together)

I am probably missing something but thinking about it to me it didn't seem like those calls would be much of a problem. Either way a new struct column is going to have to be built and the existing struct column is going to have to be read from disk in its entirety.

Yeah, I don't think it will be that much faster (maybe it would save some virtual function calls or something). What I am really concerned about is having consistent cast behavior

But the pushback made me think that maybe we should be doing something else here: we should be inspecting the expression more deeply and in the case of col.field if field is not present in the physical schema we replace the whole get_field(col, 'field') expression with null instead of get_col(struct('other', get_field(col, 'other'), 'field', null))) which is basically what is happening now. Then we avoid reading the column altogether. We can do something similar with casts. I think this is complimentary to your proposal because the fallthrough case for select struct_col can call cast if needed like any other column, and our internal cast implementation can do the "smart things" with structs.

👍

@adriangb
Copy link
Contributor Author

adriangb commented Jul 1, 2025

@alamb I've reworked this as per discussion in #16589 (comment). This now leaves the actual casting work up the cast functions, which means that we can do the work there and it will just trickle through to here. I do still recommend moving forward with this PR because if we ever wanted to reference functions-nested or similar in this predicate adaptation code we'll have to create this new module anyway. There is a world where we don't offer this rewrite by default in these APIs (which would force the whole column to be read and cast) and instead make a hook and plug it in via SessionContext or something like that. But I think that sounds more complicated and the downsides of having the new module are not that great (in fact there's pros for compile time, etc.).

Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some observations that stood out to me.

Continuing to review...

Ok(Transformed::no(expr))
}

fn try_rewrite_struct_field_access(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new try_rewrite_struct_field_access handles missing struct fields by returning Null. I tried adding a test case for nested structs (e.g., a.b.c) to ensure recursive behavior but it failed.

#[test]
    fn test_rewrite_nested_struct_missing_field() {
        let physical_schema = Schema::new(vec![Field::new(
            "nested",
            DataType::Struct(
                vec![Field::new(
                    "a",
                    DataType::Struct(vec![Field::new("b", DataType::Utf8, true)].into()),
                    true,
                )]
                .into(),
            ),
            true,
        )]);

        let logical_schema = Schema::new(vec![Field::new(
            "nested",
            DataType::Struct(
                vec![Field::new(
                    "a",
                    DataType::Struct(
                        vec![
                            Field::new("b", DataType::Utf8, true),
                            Field::new("c", DataType::Int32, true),
                        ]
                        .into(),
                    ),
                    true,
                )]
                .into(),
            ),
            true,
        )]);

        let rewriter = PhysicalExprSchemaRewriter::new(&physical_schema, &logical_schema);

        let column_expr = Arc::new(Column::new("nested", 0));

        let result = rewriter.rewrite(column_expr).unwrap();

        let expected = Arc::new(CastExpr::new(
            Arc::new(Column::new("nested", 0)),
            DataType::Struct(
                vec![Field::new(
                    "a",
                    DataType::Struct(
                        vec![
                            Field::new("b", DataType::Utf8, true),
                            Field::new("c", DataType::Int32, true),
                        ]
                        .into(),
                    ),
                    true,
                )]
                .into(),
            ),
            None,
        )) as Arc<dyn PhysicalExpr>;

        assert_eq!(result.to_string(), expected.to_string());
    }
called `Result::unwrap()` on an `Err` value: Execution("Cannot cast column 'nested' from 'Struct(a Struct(b Utf8))' (physical data type) to 'Struct(a Struct(b Utf8, c Int32))' (logical data type)")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's expected: the point is that we need to implement struct casting as part of the cast operator in general, not as part of this PR. That's the point @alamb made in #16589 (comment). Is there any reason why we haven't done that since you've basically implemented it for SchemaAdapter? I'd think it's the same code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically: I think we need to incorporate your logic from #1637 into CastExpr. Somewhat related to apache/arrow-rs#6735

Copy link
Contributor

@kosiew kosiew Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaa..... 🤔
The PR Title - implement predicate adaptation for nested structs and

Functionality equivalent to https://github.com/apache/datafusion/pull/16371 but for https://github.com/apache/datafusion/pull/16461.

gave me the expectation that this PR implements nested struct adaption already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. It used to. I've updated the title and PR description

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can file a ticket to make sure this is properly tracked

@adriangb adriangb changed the title feat: implement predicate adaptation for nested structs feat: implement predicate adaptation missing fields of structs Jul 2, 2025
}

#[test]
fn test_rewrite_mulit_column_expr_with_type_cast() {
Copy link
Contributor

@kosiew kosiew Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for replacing this multi column test with the single column test_rewrite_column_with_type_cast?

}

#[test]
fn test_rewrite_no_change_needed() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for removing this test?

@adriangb adriangb force-pushed the nested-expr branch 2 times, most recently from f214376 to 5d5daa9 Compare August 4, 2025 16:08
@github-actions github-actions bot added the core Core DataFusion crate label Aug 4, 2025
@adriangb adriangb requested review from alamb and kosiew August 4, 2025 19:30
@adriangb
Copy link
Contributor Author

adriangb commented Aug 4, 2025

@kosiew @alamb this is ready for another round of review

@adriangb
Copy link
Contributor Author

adriangb commented Aug 6, 2025

Nevermind need to fix the merge. Sorry for the ping.

Looks good actually, just import churn.

@alamb
Copy link
Contributor

alamb commented Aug 6, 2025

I will put this on my review queue for tomorrow

@alamb alamb changed the title feat: implement predicate adaptation missing fields of structs feat: add datafusion-physical-adapter, implement predicate adaptation missing fields of structs Aug 7, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adriangb -- this PR makes sense to me

It would be great to unify this struct coercion/casting logic eventually

This handles cases such as `lit(ScalarValue::Int32(123)) = int64_column` by rewriting it to `lit(ScalarValue::Int32(123)) = cast(int64_column, 'Int32')`
(note: this does not attempt to then simplify such expressions, that is done by shared simplifiers).

## Overview
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend putting this information on the doc comments of PhysicalExprSchemaRewriter itself as it will be more likely someone will find it (either in docs.rs or their IDE) than this readme

The high level overview is cool but I recommend keeping the README relatively brief

Ok(Transformed::no(expr))
}

fn try_rewrite_struct_field_access(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can file a ticket to make sure this is properly tracked

None => return Ok(None),
};

if get_field_expr.name() != "get_field" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to check the function name. Isn't the check for downcast_ref().is_none sufficient?

Also, it seems to me that this rewrite is specific to GetFieldFunc so I think it might make sense, long term, to put the function alongside the GetFieldFunc somehow 🤔

Maybe we could add a trait to PhysicalExpr like rewrite_for_schema or something 🤔

Copy link
Contributor Author

@adriangb adriangb Aug 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen / used this pattern a lot. I added 50b1e91 which should simplify it not just for this PR but any other use of matching a specific function from a PhysicalExpr as well.

None => return Ok(None),
};

let field_name = match lit.value() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 548 to 558
let result = adapter.rewrite(column_expr);
assert!(result.is_err());
let error_msg = result.unwrap_err().to_string();
assert_contains!(error_msg, "Cannot cast column 'data'");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make this simpler by using unwrap_err I think

Suggested change
let result = adapter.rewrite(column_expr);
assert!(result.is_err());
let error_msg = result.unwrap_err().to_string();
assert_contains!(error_msg, "Cannot cast column 'data'");
let error_msg = adapter.rewrite(column_expr).unwrap_err().to_string();
assert_contains!(error_msg, "Cannot cast column 'data'");

Similarly for other ones below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! 16ed5a0

adriangb and others added 6 commits August 10, 2025 11:01
Add back three tests that were removed during the refactoring:
- test_rewrite_mulit_column_expr_with_type_cast: Tests complex multi-column expressions
- test_rewrite_no_change_needed: Tests that expressions are unchanged when no transformation needed
- test_adapt_batches: Example showing how to use the rewriter with RecordBatches

These tests provide important coverage and documentation of the API.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
This commit:
- Reverts the API change from trait-based to struct-based design
- Keeps the new datafusion-physical-expr-adapter crate to avoid circular dependencies
- Keeps the try_rewrite_struct_field_access functionality (the main feature)
- Restores PhysicalExprAdapter and PhysicalExprAdapterFactory traits
- Keeps all tests using the trait-based API

The only structural change that remains is moving the code to a new crate,
which is necessary to avoid circular dependencies.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@adriangb adriangb mentioned this pull request Aug 10, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Aug 10, 2025
@adriangb
Copy link
Contributor Author

adriangb commented Aug 10, 2025

@alamb I've also:

Once you confirm those are okay I think we can merge this. I still have concerns about the lack of full support for nested structs but (1) I think partial support is better than no support, we can add it later and (2) I would like to make the breaking change of moving the code ASAP while these public APIs are still brand new and have few users

@alamb
Copy link
Contributor

alamb commented Aug 12, 2025

@kosiew is this PR ok with you?

@kosiew
Copy link
Contributor

kosiew commented Aug 15, 2025

@alamb, @adriangb,

Yes, PR is ok with me.

@adriangb adriangb merged commit e3d3257 into apache:main Aug 15, 2025
30 checks passed
@adriangb adriangb deleted the nested-expr branch August 15, 2025 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants