-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat: Add constant column extraction and rewriting for projections in ParquetOpener #19136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add constant column extraction and rewriting for projections in ParquetOpener #19136
Conversation
|
So quick! |
adriangb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will create some conflicts with #19128 but this is small and self contained so I see no reason to not proceed with it now and I can deal with the conflicts later... would help to get a review over there and associated PRs 🎣
| if !constant_columns.is_empty() { | ||
| predicate = predicate | ||
| .map(|expr| { | ||
| if is_dynamic_physical_expr(&expr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this clause? What breaks if we remove it? I'd think that rewriting the dynamic expression would work - it would try to rewrite it's children, which shouldn't cause any issues. Once snapshot is called the produces expression should have the remapped children.
| projection.try_map_exprs(|expr| rewrite_physical_expr_with_constants(expr, constants)) | ||
| } | ||
|
|
||
| fn rewrite_physical_expr_with_constants( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very similar to https://github.com/apache/datafusion/pull/19128/files#diff-6bad7e4ee6dbc3a498e3fee746f2c3c18bdcf237d7cd12226e392f9b9c3d2fbe, we should be able to use it for partition values as well 😄
|
@Weijun-H let me know if you want to merge or chat about the feedback, I don't want to merge something that conflicts and then have this good work go stale |
This PR does some refactoring of `PhysicalExprAdapter` and `PhysicalExprSimplifier` that I found necessary and/or beneficial while working on #19111. ## Changes made ### Replace `PhysicalExprAdapter::with_partition_values` with `replace_columns_with_literals` This is a nice improvement because it: 1. Makes the `PhysicalExprAdapter` trait that users might need to implement simpler (less boilerplate for users). 2. Decouples these two transformations so that we can replace partition values and then apply a projection without having to also do the schema mapping (it would be from the logical schema to the logical schema, confusing and a waste of compute). I ran into this need in #19111. I think there may be other ways of doing it (e.g. piping in the expected output schema from ParquetSource) but it felt nicer this way and I expect other places may also need the decoupled transformation. 3. I think we can use it in the future to implement #19089 (edit: evidently I was right, see identical function in #19136). 4. It's less lines of code 😄 This will require any users calling `PhysicalExprAdapter` directly to change their code, I can add an entry to the upgrade guide. ### Remove partition pruning logic from `FilePruner` and deprecate now unused `PrunableStatistics` and `CompositePruningStatistics`. Since we replace partition values with literals we no longer need these structures, they get handled like any other literals. This is a good chunk of code / complexity that we can bin off. ### Use `TableSchema` instead of `SchemaRef` + `Vec<FieldRef>` in `ParquetOpener` `TableSchema` is basically `SchemaRef` + `Vec<FieldRef>` already and since `ParquetSource` has a `TableSchema` its less code and less clones to propagate that into `ParquetOpener` --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Rationale for this change
Use file/group statistics to detect constant (including all-NULL) columns so we can avoid reading/decoding useless data and simplify predicates for earlier pruning.
What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
No