Skip to content

Conversation

@silundong
Copy link
Contributor

@silundong silundong commented Nov 6, 2025

Copy link
Contributor

@mihaibudiu mihaibudiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearly, this could use some comments to help the review.
The code looks pretty good and compact (for what it achieves), and I expect it won't require a heroic effort to review.

I have a question about the D relations in the paper: how is the invariant that the left node of a correlate has distinct rows (required for the recursive algorithm)?
Could this be enforced somehow by the type system, by using a wrapper to represent such relations?

protected final ImmutableBitSet requiredColumns;
protected final JoinRelType joinType;
protected final ImmutableList<RelHint> hints;
protected final RexNode condition;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a breaking change
can it be done in a backwards-compatible way?

An alternative is to introduce a new kind of RelNode, e.g, CorrelateWithCondition, and deprecate the existing Correlate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also my concern. When removing SOME/IN subqueries, the condition need to be retained in Mark type Correlate (it cannot be pulled up or pushed down). So I temporarily added the condition to Correlate and implemented some constraints to ensure the original behavior isn't affected. Should I create a new operator?

lateDecorrelate, topDownGeneralDecorrelate);
}

public RelOptFixture withTopDownGeneralDecorrelate(final boolean topDownGeneralDecorrelate) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this means "use the new algorithm instead of the traditional one"

int markIndex = join.getRowType().getFieldCount() - 1;
ImmutableBitSet projectColumns = RelOptUtil.InputFinder.bits(project.getProjects(), null);
ImmutableBitSet filterColumns = RelOptUtil.InputFinder.bits(filter.getCondition());
// Proj <- does not project marker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean - no result of the project depends on this column

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The premise for simplifying MARK to SEMI/ANTI is that the parent nodes no longer use the marker column except for Filter. Therefore, I adopted the Project-Fitler-Join matching pattern to detect that there are no references to marker in the Project.

ImmutableBitSet projectColumns = RelOptUtil.InputFinder.bits(project.getProjects(), null);
ImmutableBitSet filterColumns = RelOptUtil.InputFinder.bits(filter.getCondition());
// Proj <- does not project marker
// Filter <- use marker in condition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

condition depends on marker?

return;
}

// After decompose the filter condition by AND, there are only two cases to simplify:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after expressing the filter condition as a conjunction

List<RexNode> conjunctions = conjunctions(condition);
for (RexNode expr : conjunctions) {
if (!expr.isA(SqlKind.IS_DISTINCT_FROM) && !expr.isA(SqlKind.IS_NOT_DISTINCT_FROM)) {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it matter what the arguments of these comparisons are?
what if they are e.g., disjunctions?

Copy link
Contributor Author

@silundong silundong Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When attempting to transform the combination of Filter(condition=not(marker)) - Join(type=mark) (preserving rows where the join condition results in FALSE) into Join(type=anti) (preserving rows where the join condition results in NULL and FALSE), it is essential to ensure that the marker doesn't contain NULL. Perhaps using !Strong.isStrong(expr) would be more comprehensive here?

SqlQuantifyOperator op = (SqlQuantifyOperator) e.op;
externalOperator = RelOptUtil.op(op.comparisonKind, SqlStdOperatorTable.EQUALS);
} else {
externalOperator = SqlStdOperatorTable.EQUALS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be safe you should check the kind here too.
is this where the SINGLE join would be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not use SINGLE join. For the rewrite of scalar subqueries, I still reused the original logic to generate a left join and an aggregate with single_value function (if necessary).

@silundong
Copy link
Contributor Author

Thank you for your review. Regarding the D relations, in the code it is the dedupFreeVarsNode, which is generated by calling the generateDedupFreeVarsNode after decorrelating the left input of the Correlate, and builder.distinct is used there to ensure it is duplicate free.
This is still a draft. Should this work continue? If so, I will continue to refine the code and comments.

@mihaibudiu
Copy link
Contributor

builder.distinct is used there to ensure it is duplicate free.

builder.distinct ensures that it is distinct, but in general the functions that do the rewrite do not know that there is a distinct somewhere deep in the tree. My question is whether there should be some information that promises that the rel node passed to one of these functions is guaranteed to be distinct. Maybe the fact that the functions are private is enough, since all callers guarantee this.

@mihaibudiu
Copy link
Contributor

This is still a draft. Should this work continue?

I think this is great. If it holds the promise to solve many of the current decorrelator limits, we should adopt it.
Do you see any downsides?

As I commented elsewhere, one problem I foresee is that the UNNEST array operation cannot be represented without a Correlate node in general, so a new Rel node may need to be introduced to be able to handle such plans. But this can also be done later.

@silundong
Copy link
Contributor Author

silundong commented Nov 8, 2025

Perhaps wrapping the D (dedupFreeVarsNode) to make it immutable after creation would be better. Did I understand you correctly?

Do you see any downsides?

The current draft implements the general decorrelation approach. For some very simple cases, the paper also proposes a simpler decorrelation that would yield cleaner plans. It seems that matching some fixed patterns using rules would suffice. I think the simple decorrelation approach can be considered later.

@mihaibudiu
Copy link
Contributor

@silundong this is great work, I hope you can advance it.
I think that covering all cases is not necessary in the first implementation, but having an extensible framework where new operators can be handled will enable more cases to be handled later.

@silundong
Copy link
Contributor Author

Thank you! I'll continue to advance and address the comments. I should be able to complete the updates in the next few days.
This PR doesn't include the enumerable implementation for mark join yet, I feel this PR is already quite large in scope. Should I complete it in this pr, or would it be better to create a new JIRA issue for it? I'm happy to work on it.

@mihaibudiu
Copy link
Contributor

No, I think that splitting this into multiple PRs will make it manageable for the reviewers; even reviewing this first piece will be non-trivial. Ideally the infra changes, which add all the new classes are separate, and then additional PRs can be added to handle various operators.

@iwanttobepowerful
Copy link
Contributor

After some consideration, I believe it’s necessary to split this issue into multiple subtasks.
To address the problem of decorrelating boolean context IN or existential subqueries directly into SEMI/ANTI joins for CALCITE-3373, we can adopt an approach similar to that of Umbra (https://umbra-db.com/interface/ the implementation system described in related papers). The steps would be: first convert some IN or existential subqueries into Mark Joins—given Mark Join’s strong expressiveness—then optimize them into SEMI/ANTI joins via rules, and finally hand them over to RelDecorrelator for processing.
I think introducing Mark Join will be highly useful for both the existing RelDecorrelator and the proposed TopDownGeneralDecorrelator we intend to implement. Furthermore, this approach of introducing Mark Join is much more elegant than the handling method in #4211. Additionally, Mark Join will only serve as an internal representation within Calcite: it will either be simplified into SEMI/ANTI joins or, if simplification is not feasible, converted into Calcite’s existing literal_agg expression through transformation rules. This process will be completely transparent to Calcite users.
In my opinion, it is more cost-effective to first introduce Mark Join, and then consider whether to implement the new TopDownGeneralDecorrelator framework at a later stage.

@iwanttobepowerful
Copy link
Contributor

WITH 
  dept(dname, deptno, loc) AS (
 VALUES 
   ('ACCOUNTING', 10, 'NEW YORK'),
   ('RESEARCH', 20, 'DALLAS'),
   ('SALES', 30, 'CHICAGO'),
   ('OPERATIONS', 40, 'BOSTON')
  ),
  emp(empno, ename, job, mgr, hiredate, sal, comm, deptno) AS (
 VALUES 
   (7369, 'SMITH', 'CLERK', 7902, '1980-12-17'::date, 800.00, NULL::numeric, 20),
   (7499, 'ALLEN', 'SALESMAN', 7698, '1981-02-20'::date, 1600.00, 300.00, 30),
   (7521, 'WARD', 'SALESMAN', 7698, '1981-02-22'::date, 1250.00, 500.00, 30),
   (7566, 'JONES', 'MANAGER', 7839, '1981-02-04'::date, 2975.00, NULL, 20),
   (7654, 'MARTIN', 'SALESMAN', 7698, '1981-09-28'::date, 1250.00, 1400.00, 30),
   (7698, 'BLAKE', 'MANAGER', 7839, '1981-01-05'::date, 2850.00, NULL, 30),
   (7782, 'CLARK', 'MANAGER', 7839, '1981-06-09'::date, 2450.00, NULL, 10),
   (7788, 'SCOTT', 'ANALYST', 7566, '1987-04-19'::date, 3000.00, NULL, 20),
   (7839, 'KING', 'PRESIDENT', NULL, '1981-11-17'::date, 5000.00, NULL, 10),
   (7844, 'TURNER', 'SALESMAN', 7698, '1981-09-08'::date, 1500.00, 0.00, 30),
   (7876, 'ADAMS', 'CLERK', 7788, '1987-05-23'::date, 1100.00, NULL, 20),
   (7900, 'JAMES', 'CLERK', 7698, '1981-12-03'::date, 950.00, NULL, 30),
   (7902, 'FORD', 'ANALYST', 7566, '1981-12-03'::date, 3000.00, NULL, 20),
   (7934, 'MILLER', 'CLERK', 7782, '1982-01-23'::date, 1300.00, NULL, 10)
  ),
  bonus(ENAME, JOB, SAL, COMM) AS (
 VALUES 
 ('ALLEN', 'SALESMAN', 1600.00, 300.00),
 ('WARD', 'SALESMAN', 1250.00, 500.00)
  )
select count(*) as c
from emp as e
where sal + 100 not in (
  select deptno
  from dept
  where dname = e.ename);

Input the above SQL statement into umbra-db interface, click Execute, and check the optimization steps one by one. I believe it will be an excellent reference for us.
Clipboard_Screenshot_1763644354
Clipboard_Screenshot_1763645122

.withOperandSupplier(b1 ->
b1.operand(Project.class).oneInput(b2 ->
b2.operand(Filter.class).oneInput(b3 ->
b3.operand(Join.class).predicate(join -> join.getJoinType() == JoinRelType.MARK)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also support the Correlate scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After decorrelation, Correlate will convert to Join, and its condition will change. Therefore, it should not be applied before decorrelation, nor should it match Correlate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my comment above, the Umbra system initially plans to simplify filters and mark joins (correlated) into anti joins (correlated) through the expression simplification phase. I believe we can do the same.
As you mentioned, the join conditions will change after decorrelation—and this is certainly true. However, this is the responsibility of the decorrelation phase, which does not prevent us from supporting both joins and correlated operations here.
Please point out any misunderstandings on my part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me clarify my understanding:
Firstly, the Umbra system's decorrelation should be completely based on the paper's algorithm. From the perspective of the paper algorithm, when decorrelate from Correlate→Join, the only change to the condition is the addition of natural join conditions for D attributes (IS NOT DISTINCT FROM), which ensures no NULL values are produced. Therefore, it is sufficient to evaluate the condition of the dependent join before decorrelation. This logic also applies to the TopDownGeneralDecorrelator in the current PR, as it is fully based on the paper's implementation. For TopDownGeneralDecorrelator, simplifying the MARK before decorrelation versus after decorrelation appears to make no difference. There is no necessity to match Correlate.

Secondly, if I understand correctly, you pointed out that this rule needs to match Correlate so that it can be adapted later in RelDecorrelator. To my knowledge, the logic of RelDecorrelator does not guarantee that the condition change from Correlate→Join is limited to adding IS NOT DISTINCT FROM. Therefore, pre-simplifying Mark is unsafe for RelDecorrelator (if RelDecorrelator supports to handle MARK Correlate in future).

So, for TopDownGeneralDecorrelator, there is no need to match Correlate; for RelDecorrelator, matching Correlate is unsafe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select empno from emp where empno not in 
(select empno from emp as emp_b where emp.ename = emp_b.ename);
// new
LogicalProject(EMPNO=[$0])
  LogicalFilter(condition=[NOT($9)])
    LogicalJoin(condition=[AND(=($0, $9), IS NOT DISTINCT FROM($1, $10))], joinType=[mark])
      LogicalTableScan(table=[[CATALOG, SALES, EMP]])
      LogicalProject(EMPNO=[$0], ENAME0=[$10])
        LogicalJoin(condition=[=($10, $1)], joinType=[inner])
          LogicalTableScan(table=[[CATALOG, SALES, EMP_B]])
          LogicalAggregate(group=[{0}])
            LogicalProject(ENAME=[$1])
              LogicalTableScan(table=[[CATALOG, SALES, EMP]])

The plan above is the one generated by your implementation; however, the more concise plan is the one below.
Could it be that this optimization was overlooked somewhere?

Project [empno#27]
+- Join LeftAnti, (((empno#27 = empno#57) OR isnull((empno#27 = empno#57))) AND (ename#28 = ename#58))
   :- Project [empno#27, ename#28]
   :  +- Relation spark_catalog.default.emp[...] parquet
   +- Project [empno#57, ename#58]
      +- Relation spark_catalog.default.emp[...] parquet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest commit implements D-elimination optimizations, which produce better plans for scenarios where subquery correlations occur under equality conditions. You can see the improvements in the test cases :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select empno from emp where empno not in 
(select empno from emp as emp_b where emp.ename = emp_b.ename);
// new
LogicalProject(EMPNO=[$0])
  LogicalFilter(condition=[NOT($9)])
    LogicalJoin(condition=[AND(=($0, $9), IS NOT DISTINCT FROM($1, $10))], joinType=[mark])
      LogicalTableScan(table=[[CATALOG, SALES, EMP]])
      LogicalProject(EMPNO=[$0], ENAME0=[$10])
        LogicalJoin(condition=[=($10, $1)], joinType=[inner])
          LogicalTableScan(table=[[CATALOG, SALES, EMP_B]])
          LogicalAggregate(group=[{0}])
            LogicalProject(ENAME=[$1])
              LogicalTableScan(table=[[CATALOG, SALES, EMP]])

The plan above is the one generated by your implementation; however, the more concise plan is the one below. Could it be that this optimization was overlooked somewhere?

Project [empno#27]
+- Join LeftAnti, (((empno#27 = empno#57) OR isnull((empno#27 = empno#57))) AND (ename#28 = ename#58))
   :- Project [empno#27, ename#28]
   :  +- Relation spark_catalog.default.emp[...] parquet
   +- Project [empno#57, ename#58]
      +- Relation spark_catalog.default.emp[...] parquet

I feel that your line of thinking might be heading in the wrong direction. In my humble understanding, the purpose of decorrelation is to use a general algorithm to remove the correlation, not to obtain the optimal plan (or a certain desired form) directly from the decorrelation process. Currently, Calcite also has corresponding rules to convert queries into semi/anti joins, which feels like it could be a follow-up optimization item.
I think for the first version, we should focus our energy on implementing the core algorithm into code. Perhaps you could file a JIRA ticket as an optimization item to be addressed after this PR. Otherwise, this PR might become too bloated and difficult to review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some scenarios where simplifying a filter + mark join (correlated) into a semi/anti join (correlated) is straightforward. This optimization can be accomplished before entering the decorrelation phase, without the need to defer it to the decorrelation stage itself.
Perhaps there are differences in the scope of problems we intend to address by introducing mark joins.
My idea is that introducing mark joins can elegantly resolve CALCITE-3373.
Perhaps your primary consideration for introducing mark joins is to adopt the decorrelation framework proposed in the paper?
I think mark joins are independent of the decorrelation framework described in the paper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get the basic infra to do this merged, and then we can iterate.
We can mark this decorrelator as experimental until it can do everything the old one can, and after that switch.
The paper is modular: each operator is a different independent procedure. So I expect the implementation can be similar.

Code reviewers are more than welcome for this PR.

The design part of this discussion should be in Jira.

@Override public void onMatch(RelOptRuleCall call) {
final Project project = call.rel(0);
final Filter filter = call.rel(1);
final Join join = call.rel(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also support the Correlate scenario.

@silundong silundong marked this pull request as ready for review November 21, 2025 06:41
@silundong silundong requested a review from mihaibudiu November 21, 2025 06:41
@mihaibudiu
Copy link
Contributor

I will do my best to review this, but I expect it won't be easy or very fast. Thank you.

@silundong
Copy link
Contributor Author

This commit completes the comments, introduces CorrelatePlus to support the condition attribute, and implements optimizations for D elimination. The test cases look good. I think the PR is now ready for review. @mihaibudiu PTAL, Thank you!


sql(sql)
.withRule(
CoreRules.PROJECT_SUBQUERY_REMOVE_ENABLE_MARK_JOIN,
Copy link
Contributor

@iwanttobepowerful iwanttobepowerful Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add more test cases for PROJECT_SUBQUERY_REMOVE_ENABLE_MARK_JOIN?
For example:

SELECT dept.deptno, EXISTS ( SELECT 1 FROM emp e WHERE e.deptno = dept.deptno ) AS has_employees FROM dept

or

SELECT emp.deptno, emp.deptno IN (SELECT deptno FROM dept where name IN('Sales', 'Engineering')) FROM emp

or

SELECT emp.deptno, emp.deptno IN (SELECT dept.deptno FROM dept where dept.deptno < emp.empno ) FROM emp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will update in next commit.

@silundong
Copy link
Contributor Author

The commit includes:

  1. Optimize the HepProgram configuration before and after general decorrelation.
  2. Fix a correlation detection bug in general decorrelation based on @iwanttobepowerful finding in CALCITE-7303. Thanks for the good catch.
  3. Add test cases.

/** Rule that converts sub-queries from filter expressions into
* {@link Correlate} instances. It will rewrite SOME/EXISTS/IN to a MARK type Correlate. */
public static final SubQueryRemoveRule FILTER_SUBQUERY_REMOVE_ENABLE_MARK_JOIN =
SubQueryRemoveRule.Config.FILTER_ENABLE_MARK_JOIN.toRule();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to keep the rules related to SubQueryRemove together, and their names could be unified as well. Would "FILTER_SUB_QUERY_REMOVE_MARK_JOIN" be better? "ENABLE" seems a bit odd.

import static org.apache.calcite.plan.RelOptUtil.conjunctions;

/**
* Rule to simplify a mark join to semi join or anti join.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be a brief comment in the Javadoc?


sql(sql)
.withRule(
CoreRules.FILTER_SUBQUERY_REMOVE_ENABLE_MARK_JOIN,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent with 4 spaces.

@iwanttobepowerful
Copy link
Contributor

2. finding in CALCITE-7303.

I identified this bug because the implementation of RelDecorrelator.java fails to cover certain scenarios. Unexpectedly, it also exposed an issue in your implementation. The more I delve into it, the more I find that RelDecorrelator and TopDownGeneralDecorrelator share striking similarities in their core design philosophies.

Copy link
Contributor

@mihaibudiu mihaibudiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks great and is very modular.
I think mostly documentation could be improved.
We should mark this as experimental, and once it's available I will try it over our entire test suite to validate correctness.

* @see CoreRules#FILTER_SUBQUERY_REMOVE_ENABLE_MARK_JOIN
* @see CoreRules#PROJECT_SUBQUERY_REMOVE_ENABLE_MARK_JOIN
*/
public abstract class CorrelatePlus extends Correlate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not call it ConditionalCorrelate?

correlationId, requiredColumns, joinType, condition);
}

@Override public Correlate copy(RelTraitSet traitSet,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an unfortunate problem with the design of the copy method; I had the same problem when adding ASOF JOIN.
An alternative would be the CorrelatePlus not to extend Correlate.
I think that this shows that the design of the copy API is wrong, and this method should not exist.

}
builder.join(JoinRelType.LEFT, leftJoinConditions);

// rewrite COUNT to case when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not seem to rewrite count, but rather to project the result produced by count

// the Sort with ORDER BY and LIMIT or OFFSET have to be changed during rewriting because
// now the limit has to be enforced per value of the outer bindings instead of globally.
// It can be rewritten using ROW_NUMBER() window function and filtering on it,
// see section 4.4 in paper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to cite both papers, the original one does not contain the sort, this algorithm is only in the second paper


if (!leftHasCorrelation && !join.getJoinType().generatesNullsOnRight()
&& join.getJoinType().projectsRight()) {
// there is no need to push down domain D to left side when both are satisfied:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when both following conditions are satisfied

// there is no need to push down domain D to left side when both are satisfied:
// 1. there is no correlation on left side
// 2. join type will not generate NULL values on right side and will project right
// the left side will start a decorrelation independently
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the left size...

@silundong
Copy link
Contributor Author

Thank you for the detailed review! I will update in the next few days.

@silundong
Copy link
Contributor Author

This commit mainly improves code style and comments. PTAL, thank you!

Copy link
Member

@xiedeyantu xiedeyantu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous comments I raised have all been addressed. Since I'm not very familiar with this module, we might want to wait for Mihai's final review.

Copy link
Contributor

@mihaibudiu mihaibudiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one question about implementing the MARK_JOIN in the enumerable convention.
If you decide to do that, maybe we can also run some quidem tests using the new decorrelator?
The plans are very hard to evaluate by a person for correctness.

* LogicalTableScan(table=[[CATALOG, SALES, DEPT]])
* </pre></blockquote>
*
* <p> If the marker is used on only conjunctive predicates the optimizer will try to translate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean "if the marker is used only in"?
So the output of MARK_JOIN has an extra nullable boolean field in addition to the fields from the left and the right inputs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to have LEFT_MARK_JOIN and RIGHT_MARK_JOIN?
Why is a single MARK_JOIN sufficient?
Maybe you should call it LEFT_MARK_JOIN to be clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be "used only in".

The output of MARK JOIN is as described in the first sentence of the comment: An MARK JOIN will keep all rows from the left side and creates a new attribute to mark a tuple as having join partners from right side or not. It will not output the fields from the right side.

I looked at the SEMI/ANTI above — I think they implicitly represent (LEFT) SEMI/ANTI, and therefore define MARK directly. Should I stay consistent with that, or explicitly name it LEFT MARK? In my option, RIGHT MARK/SEMI/ANTI are variants of LEFT MARK/SEMI/ANTI to support join commutativity. Should I define them in this ticket?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you expect to have a RIGHT_MARK, then I think you should call this one a LEFT_MARK, even if you are not going to add the RIGHT_MARK in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I will update it to LEFT MARK.

@Override public RelNode convert(RelNode rel) {
@Override public @Nullable RelNode convert(RelNode rel) {
Join join = (Join) rel;
if (!Bug.TODO_FIXED && join.getJoinType() == JoinRelType.MARK) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge this PR without this implementation to validate actual computations?
From the description it does not look like having a native mark join in this convention is actually too much work.

@silundong
Copy link
Contributor Author

Is it a good time to create a new ticket to implement LEFT MARK JOIN in the Enumerable convention? Should this PR wait for that implementation before being merged? Please let me know your preference. :)

@asolimando
Copy link
Member

Is it a good time to create a new ticket to implement LEFT MARK JOIN in the Enumerable convention? Should this PR wait for that implementation before being merged? Please let me know your preference. :)

The PR is extremely helpful, if that's the only blocker (and IIRC the use of this new decorrelator is totally optional) I wouldn't block on that

@mihaibudiu
Copy link
Contributor

mihaibudiu commented Dec 4, 2025

Is it a good time to create a new ticket to implement LEFT MARK JOIN in the Enumerable convention? Should this PR wait for that implementation before being merged? Please let me know your preference. :)

As @asolimando says, this PR is already big enough, so you may as well add it in a separate PR, but then it means you can file the issue now about doing it.

@mihaibudiu
Copy link
Contributor

I have already approved, so from my point of view feel free to merge after the conflicts are resolved and the commits are squashed. If anyone else plans to submit a review, please let us know to wait before merging.

@silundong silundong force-pushed the subquery_corelated branch 2 times, most recently from 9a7b1d7 to 05b697f Compare December 5, 2025 02:02
@sonarqubecloud
Copy link

sonarqubecloud bot commented Dec 5, 2025

@silundong
Copy link
Contributor Author

This PR looks ready to me. Would it be possible to merge it to advance CALCITE-7315? Thank you! @mihaibudiu @asolimando @xiedeyantu

@mihaibudiu
Copy link
Contributor

Can you write some documentation someplace (could be JavaDoc or JIRA) about how one should proceed to replace the old decorrelator with this one?

Do you have a plan to migrate other tests to also use this decorrelator somehow?

@silundong
Copy link
Contributor Author

Can you write some documentation someplace (could be JavaDoc or JIRA) about how one should proceed to replace the old decorrelator with this one?

Sure, I will add some JavaDoc in TopDownDecorrelator.java.

From my observation, the most commonly used tests are RelOptRulesTest and the Quidem tests. RelOptRulesTest can already switch the decorrelator via configuration. Quidem tests involves the entire SQL execution. After completing the LEFT MARK join under the Enumerable convention, may be we can integrate the new decorrelator into the Quidem tests and make the switch configurable in the iq files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants