[CALCITE-7031] Implement the general decorrelation algorithm (Neumann & Kemper) #4619

silundong · 2025-11-06T06:15:55Z

mihaibudiu

Clearly, this could use some comments to help the review.
The code looks pretty good and compact (for what it achieves), and I expect it won't require a heroic effort to review.

I have a question about the D relations in the paper: how is the invariant that the left node of a correlate has distinct rows (required for the recursive algorithm)?
Could this be enforced somehow by the type system, by using a wrapper to represent such relations?

mihaibudiu · 2025-11-06T17:09:52Z

core/src/main/java/org/apache/calcite/rel/core/Correlate.java

  protected final ImmutableBitSet requiredColumns;
  protected final JoinRelType joinType;
  protected final ImmutableList<RelHint> hints;
+  protected final RexNode condition;


this looks like a breaking change
can it be done in a backwards-compatible way?

An alternative is to introduce a new kind of RelNode, e.g, CorrelateWithCondition, and deprecate the existing Correlate.

This is also my concern. When removing SOME/IN subqueries, the condition need to be retained in Mark type Correlate (it cannot be pulled up or pushed down). So I temporarily added the condition to Correlate and implemented some constraints to ensure the original behavior isn't affected. Should I create a new operator?

core/src/main/java/org/apache/calcite/rel/core/JoinRelType.java

core/src/main/java/org/apache/calcite/rel/rules/CoreRules.java

mihaibudiu · 2025-11-06T17:33:27Z

testkit/src/main/java/org/apache/calcite/test/RelOptFixture.java

+        lateDecorrelate, topDownGeneralDecorrelate);
+  }
+
+  public RelOptFixture withTopDownGeneralDecorrelate(final boolean topDownGeneralDecorrelate) {


I assume this means "use the new algorithm instead of the traditional one"

mihaibudiu · 2025-11-06T23:18:57Z

core/src/main/java/org/apache/calcite/rel/rules/MarkToSemiOrAntiJoinRule.java

+    int markIndex = join.getRowType().getFieldCount() - 1;
+    ImmutableBitSet projectColumns = RelOptUtil.InputFinder.bits(project.getProjects(), null);
+    ImmutableBitSet filterColumns = RelOptUtil.InputFinder.bits(filter.getCondition());
+    // Proj       <- does not project marker


what does this mean - no result of the project depends on this column

The premise for simplifying MARK to SEMI/ANTI is that the parent nodes no longer use the marker column except for Filter. Therefore, I adopted the Project-Fitler-Join matching pattern to detect that there are no references to marker in the Project.

mihaibudiu · 2025-11-06T23:19:10Z

core/src/main/java/org/apache/calcite/rel/rules/MarkToSemiOrAntiJoinRule.java

+    ImmutableBitSet projectColumns = RelOptUtil.InputFinder.bits(project.getProjects(), null);
+    ImmutableBitSet filterColumns = RelOptUtil.InputFinder.bits(filter.getCondition());
+    // Proj       <- does not project marker
+    //  Filter    <- use marker in condition


condition depends on marker?

mihaibudiu · 2025-11-06T23:20:35Z

core/src/main/java/org/apache/calcite/rel/rules/MarkToSemiOrAntiJoinRule.java

+      return;
+    }
+
+    // After decompose the filter condition by AND, there are only two cases to simplify:


after expressing the filter condition as a conjunction

mihaibudiu · 2025-11-06T23:30:29Z

core/src/main/java/org/apache/calcite/rel/rules/MarkToSemiOrAntiJoinRule.java

+    List<RexNode> conjunctions = conjunctions(condition);
+    for (RexNode expr : conjunctions) {
+      if (!expr.isA(SqlKind.IS_DISTINCT_FROM) && !expr.isA(SqlKind.IS_NOT_DISTINCT_FROM)) {
+        return false;


does it matter what the arguments of these comparisons are?
what if they are e.g., disjunctions?

When attempting to transform the combination of Filter(condition=not(marker)) - Join(type=mark) (preserving rows where the join condition results in FALSE) into Join(type=anti) (preserving rows where the join condition results in NULL and FALSE), it is essential to ensure that the marker doesn't contain NULL. Perhaps using !Strong.isStrong(expr) would be more comprehensive here?

mihaibudiu · 2025-11-07T01:29:55Z

core/src/main/java/org/apache/calcite/rel/rules/SubQueryRemoveRule.java

+      SqlQuantifyOperator op = (SqlQuantifyOperator) e.op;
+      externalOperator = RelOptUtil.op(op.comparisonKind, SqlStdOperatorTable.EQUALS);
+    } else {
+      externalOperator = SqlStdOperatorTable.EQUALS;


To be safe you should check the kind here too.
is this where the SINGLE join would be used?

I did not use SINGLE join. For the rewrite of scalar subqueries, I still reused the original logic to generate a left join and an aggregate with single_value function (if necessary).

core/src/main/java/org/apache/calcite/sql2rel/TopDownGeneralDecorrelator.java

silundong · 2025-11-07T09:54:16Z

Thank you for your review. Regarding the D relations, in the code it is the dedupFreeVarsNode, which is generated by calling the generateDedupFreeVarsNode after decorrelating the left input of the Correlate, and builder.distinct is used there to ensure it is duplicate free.
This is still a draft. Should this work continue? If so, I will continue to refine the code and comments.

mihaibudiu · 2025-11-07T17:27:02Z

builder.distinct is used there to ensure it is duplicate free.

builder.distinct ensures that it is distinct, but in general the functions that do the rewrite do not know that there is a distinct somewhere deep in the tree. My question is whether there should be some information that promises that the rel node passed to one of these functions is guaranteed to be distinct. Maybe the fact that the functions are private is enough, since all callers guarantee this.

mihaibudiu · 2025-11-07T17:29:12Z

This is still a draft. Should this work continue?

I think this is great. If it holds the promise to solve many of the current decorrelator limits, we should adopt it.
Do you see any downsides?

As I commented elsewhere, one problem I foresee is that the UNNEST array operation cannot be represented without a Correlate node in general, so a new Rel node may need to be introduced to be able to handle such plans. But this can also be done later.

silundong · 2025-11-08T08:19:35Z

Perhaps wrapping the D (dedupFreeVarsNode) to make it immutable after creation would be better. Did I understand you correctly?

Do you see any downsides?

The current draft implements the general decorrelation approach. For some very simple cases, the paper also proposes a simpler decorrelation that would yield cleaner plans. It seems that matching some fixed patterns using rules would suffice. I think the simple decorrelation approach can be considered later.

mihaibudiu · 2025-11-19T16:40:45Z

@silundong this is great work, I hope you can advance it.
I think that covering all cases is not necessary in the first implementation, but having an extensible framework where new operators can be handled will enable more cases to be handled later.

silundong · 2025-11-20T01:27:25Z

Thank you! I'll continue to advance and address the comments. I should be able to complete the updates in the next few days.
This PR doesn't include the enumerable implementation for mark join yet, I feel this PR is already quite large in scope. Should I complete it in this pr, or would it be better to create a new JIRA issue for it? I'm happy to work on it.

mihaibudiu · 2025-11-20T01:33:31Z

No, I think that splitting this into multiple PRs will make it manageable for the reviewers; even reviewing this first piece will be non-trivial. Ideally the infra changes, which add all the new classes are separate, and then additional PRs can be added to handle various operators.

iwanttobepowerful · 2025-11-20T13:18:06Z

After some consideration, I believe it’s necessary to split this issue into multiple subtasks.
To address the problem of decorrelating boolean context IN or existential subqueries directly into SEMI/ANTI joins for CALCITE-3373, we can adopt an approach similar to that of Umbra (https://umbra-db.com/interface/ the implementation system described in related papers). The steps would be: first convert some IN or existential subqueries into Mark Joins—given Mark Join’s strong expressiveness—then optimize them into SEMI/ANTI joins via rules, and finally hand them over to RelDecorrelator for processing.
I think introducing Mark Join will be highly useful for both the existing RelDecorrelator and the proposed TopDownGeneralDecorrelator we intend to implement. Furthermore, this approach of introducing Mark Join is much more elegant than the handling method in #4211. Additionally, Mark Join will only serve as an internal representation within Calcite: it will either be simplified into SEMI/ANTI joins or, if simplification is not feasible, converted into Calcite’s existing literal_agg expression through transformation rules. This process will be completely transparent to Calcite users.
In my opinion, it is more cost-effective to first introduce Mark Join, and then consider whether to implement the new TopDownGeneralDecorrelator framework at a later stage.

iwanttobepowerful · 2025-11-20T13:26:31Z

WITH 
  dept(dname, deptno, loc) AS (
 VALUES 
   ('ACCOUNTING', 10, 'NEW YORK'),
   ('RESEARCH', 20, 'DALLAS'),
   ('SALES', 30, 'CHICAGO'),
   ('OPERATIONS', 40, 'BOSTON')
  ),
  emp(empno, ename, job, mgr, hiredate, sal, comm, deptno) AS (
 VALUES 
   (7369, 'SMITH', 'CLERK', 7902, '1980-12-17'::date, 800.00, NULL::numeric, 20),
   (7499, 'ALLEN', 'SALESMAN', 7698, '1981-02-20'::date, 1600.00, 300.00, 30),
   (7521, 'WARD', 'SALESMAN', 7698, '1981-02-22'::date, 1250.00, 500.00, 30),
   (7566, 'JONES', 'MANAGER', 7839, '1981-02-04'::date, 2975.00, NULL, 20),
   (7654, 'MARTIN', 'SALESMAN', 7698, '1981-09-28'::date, 1250.00, 1400.00, 30),
   (7698, 'BLAKE', 'MANAGER', 7839, '1981-01-05'::date, 2850.00, NULL, 30),
   (7782, 'CLARK', 'MANAGER', 7839, '1981-06-09'::date, 2450.00, NULL, 10),
   (7788, 'SCOTT', 'ANALYST', 7566, '1987-04-19'::date, 3000.00, NULL, 20),
   (7839, 'KING', 'PRESIDENT', NULL, '1981-11-17'::date, 5000.00, NULL, 10),
   (7844, 'TURNER', 'SALESMAN', 7698, '1981-09-08'::date, 1500.00, 0.00, 30),
   (7876, 'ADAMS', 'CLERK', 7788, '1987-05-23'::date, 1100.00, NULL, 20),
   (7900, 'JAMES', 'CLERK', 7698, '1981-12-03'::date, 950.00, NULL, 30),
   (7902, 'FORD', 'ANALYST', 7566, '1981-12-03'::date, 3000.00, NULL, 20),
   (7934, 'MILLER', 'CLERK', 7782, '1982-01-23'::date, 1300.00, NULL, 10)
  ),
  bonus(ENAME, JOB, SAL, COMM) AS (
 VALUES 
 ('ALLEN', 'SALESMAN', 1600.00, 300.00),
 ('WARD', 'SALESMAN', 1250.00, 500.00)
  )
select count(*) as c
from emp as e
where sal + 100 not in (
  select deptno
  from dept
  where dname = e.ename);

Input the above SQL statement into umbra-db interface, click Execute, and check the optimization steps one by one. I believe it will be an excellent reference for us.

iwanttobepowerful · 2025-11-20T13:41:24Z

core/src/main/java/org/apache/calcite/rel/rules/MarkToSemiOrAntiJoinRule.java

+        .withOperandSupplier(b1 ->
+            b1.operand(Project.class).oneInput(b2 ->
+                b2.operand(Filter.class).oneInput(b3 ->
+                    b3.operand(Join.class).predicate(join -> join.getJoinType() == JoinRelType.MARK)


We should also support the Correlate scenario.

After decorrelation, Correlate will convert to Join, and its condition will change. Therefore, it should not be applied before decorrelation, nor should it match Correlate.

In my comment above, the Umbra system initially plans to simplify filters and mark joins (correlated) into anti joins (correlated) through the expression simplification phase. I believe we can do the same.
As you mentioned, the join conditions will change after decorrelation—and this is certainly true. However, this is the responsibility of the decorrelation phase, which does not prevent us from supporting both joins and correlated operations here.
Please point out any misunderstandings on my part.

Let me clarify my understanding:
Firstly, the Umbra system's decorrelation should be completely based on the paper's algorithm. From the perspective of the paper algorithm, when decorrelate from Correlate→Join, the only change to the condition is the addition of natural join conditions for D attributes (IS NOT DISTINCT FROM), which ensures no NULL values are produced. Therefore, it is sufficient to evaluate the condition of the dependent join before decorrelation. This logic also applies to the TopDownGeneralDecorrelator in the current PR, as it is fully based on the paper's implementation. For TopDownGeneralDecorrelator, simplifying the MARK before decorrelation versus after decorrelation appears to make no difference. There is no necessity to match Correlate.

Secondly, if I understand correctly, you pointed out that this rule needs to match Correlate so that it can be adapted later in RelDecorrelator. To my knowledge, the logic of RelDecorrelator does not guarantee that the condition change from Correlate→Join is limited to adding IS NOT DISTINCT FROM. Therefore, pre-simplifying Mark is unsafe for RelDecorrelator (if RelDecorrelator supports to handle MARK Correlate in future).

So, for TopDownGeneralDecorrelator, there is no need to match Correlate; for RelDecorrelator, matching Correlate is unsafe.

select empno from emp where empno not in (select empno from emp as emp_b where emp.ename = emp_b.ename);

// new LogicalProject(EMPNO=[$0]) LogicalFilter(condition=[NOT($9)]) LogicalJoin(condition=[AND(=($0, $9), IS NOT DISTINCT FROM($1, $10))], joinType=[mark]) LogicalTableScan(table=[[CATALOG, SALES, EMP]]) LogicalProject(EMPNO=[$0], ENAME0=[$10]) LogicalJoin(condition=[=($10, $1)], joinType=[inner]) LogicalTableScan(table=[[CATALOG, SALES, EMP_B]]) LogicalAggregate(group=[{0}]) LogicalProject(ENAME=[$1]) LogicalTableScan(table=[[CATALOG, SALES, EMP]])

The plan above is the one generated by your implementation; however, the more concise plan is the one below.
Could it be that this optimization was overlooked somewhere?

Project [empno#27] +- Join LeftAnti, (((empno#27 = empno#57) OR isnull((empno#27 = empno#57))) AND (ename#28 = ename#58)) :- Project [empno#27, ename#28] : +- Relation spark_catalog.default.emp[...] parquet +- Project [empno#57, ename#58] +- Relation spark_catalog.default.emp[...] parquet

The latest commit implements D-elimination optimizations, which produce better plans for scenarios where subquery correlations occur under equality conditions. You can see the improvements in the test cases :)

select empno from emp where empno not in (select empno from emp as emp_b where emp.ename = emp_b.ename);

// new LogicalProject(EMPNO=[$0]) LogicalFilter(condition=[NOT($9)]) LogicalJoin(condition=[AND(=($0, $9), IS NOT DISTINCT FROM($1, $10))], joinType=[mark]) LogicalTableScan(table=[[CATALOG, SALES, EMP]]) LogicalProject(EMPNO=[$0], ENAME0=[$10]) LogicalJoin(condition=[=($10, $1)], joinType=[inner]) LogicalTableScan(table=[[CATALOG, SALES, EMP_B]]) LogicalAggregate(group=[{0}]) LogicalProject(ENAME=[$1]) LogicalTableScan(table=[[CATALOG, SALES, EMP]])

The plan above is the one generated by your implementation; however, the more concise plan is the one below. Could it be that this optimization was overlooked somewhere?

Project [empno#27] +- Join LeftAnti, (((empno#27 = empno#57) OR isnull((empno#27 = empno#57))) AND (ename#28 = ename#58)) :- Project [empno#27, ename#28] : +- Relation spark_catalog.default.emp[...] parquet +- Project [empno#57, ename#58] +- Relation spark_catalog.default.emp[...] parquet

I feel that your line of thinking might be heading in the wrong direction. In my humble understanding, the purpose of decorrelation is to use a general algorithm to remove the correlation, not to obtain the optimal plan (or a certain desired form) directly from the decorrelation process. Currently, Calcite also has corresponding rules to convert queries into semi/anti joins, which feels like it could be a follow-up optimization item.
I think for the first version, we should focus our energy on implementing the core algorithm into code. Perhaps you could file a JIRA ticket as an optimization item to be addressed after this PR. Otherwise, this PR might become too bloated and difficult to review.

I think there are some scenarios where simplifying a filter + mark join (correlated) into a semi/anti join (correlated) is straightforward. This optimization can be accomplished before entering the decorrelation phase, without the need to defer it to the decorrelation stage itself.
Perhaps there are differences in the scope of problems we intend to address by introducing mark joins.
My idea is that introducing mark joins can elegantly resolve CALCITE-3373.
Perhaps your primary consideration for introducing mark joins is to adopt the decorrelation framework proposed in the paper?
I think mark joins are independent of the decorrelation framework described in the paper.

Let's get the basic infra to do this merged, and then we can iterate.
We can mark this decorrelator as experimental until it can do everything the old one can, and after that switch.
The paper is modular: each operator is a different independent procedure. So I expect the implementation can be similar.

Code reviewers are more than welcome for this PR.

The design part of this discussion should be in Jira.

iwanttobepowerful · 2025-11-20T13:42:01Z

core/src/main/java/org/apache/calcite/rel/rules/MarkToSemiOrAntiJoinRule.java

+  @Override public void onMatch(RelOptRuleCall call) {
+    final Project project = call.rel(0);
+    final Filter filter = call.rel(1);
+    final Join join = call.rel(2);


We should also support the Correlate scenario.

mihaibudiu · 2025-11-21T06:47:48Z

I will do my best to review this, but I expect it won't be easy or very fast. Thank you.

silundong · 2025-11-21T06:52:43Z

This commit completes the comments, introduces CorrelatePlus to support the condition attribute, and implements optimizations for D elimination. The test cases look good. I think the PR is now ready for review. @mihaibudiu PTAL, Thank you!

iwanttobepowerful · 2025-11-21T17:38:24Z

core/src/test/java/org/apache/calcite/test/RelOptRulesTest.java

+
+    sql(sql)
+        .withRule(
+            CoreRules.PROJECT_SUBQUERY_REMOVE_ENABLE_MARK_JOIN,


Could we add more test cases for PROJECT_SUBQUERY_REMOVE_ENABLE_MARK_JOIN?
For example:

SELECT dept.deptno, EXISTS ( SELECT 1 FROM emp e WHERE e.deptno = dept.deptno ) AS has_employees FROM dept

or

SELECT emp.deptno, emp.deptno IN (SELECT deptno FROM dept where name IN('Sales', 'Engineering')) FROM emp

or

SELECT emp.deptno, emp.deptno IN (SELECT dept.deptno FROM dept where dept.deptno < emp.empno ) FROM emp

Sure, I will update in next commit.

silundong · 2025-11-24T06:13:56Z

The commit includes:

Optimize the HepProgram configuration before and after general decorrelation.
Fix a correlation detection bug in general decorrelation based on @iwanttobepowerful finding in CALCITE-7303. Thanks for the good catch.
Add test cases.

xiedeyantu · 2025-11-24T06:48:25Z

core/src/main/java/org/apache/calcite/rel/rules/CoreRules.java

+  /** Rule that converts sub-queries from filter expressions into
+   * {@link Correlate} instances. It will rewrite SOME/EXISTS/IN to a MARK type Correlate. */
+  public static final SubQueryRemoveRule FILTER_SUBQUERY_REMOVE_ENABLE_MARK_JOIN =
+      SubQueryRemoveRule.Config.FILTER_ENABLE_MARK_JOIN.toRule();


It would be better to keep the rules related to SubQueryRemove together, and their names could be unified as well. Would "FILTER_SUB_QUERY_REMOVE_MARK_JOIN" be better? "ENABLE" seems a bit odd.

xiedeyantu · 2025-11-24T06:49:39Z

core/src/main/java/org/apache/calcite/rel/rules/MarkToSemiOrAntiJoinRule.java

+import static org.apache.calcite.plan.RelOptUtil.conjunctions;
+
+/**
+ * Rule to simplify a mark join to semi join or anti join.


Can there be a brief comment in the Javadoc?

xiedeyantu · 2025-11-24T07:28:09Z

core/src/test/java/org/apache/calcite/test/RelOptRulesTest.java

+
+    sql(sql)
+            .withRule(
+                    CoreRules.FILTER_SUBQUERY_REMOVE_ENABLE_MARK_JOIN,


Indent with 4 spaces.

core/src/test/java/org/apache/calcite/test/RelOptRulesTest.java

core/src/main/java/org/apache/calcite/sql2rel/TopDownGeneralDecorrelator.java

iwanttobepowerful · 2025-11-24T09:23:33Z

2. finding in CALCITE-7303.

I identified this bug because the implementation of RelDecorrelator.java fails to cover certain scenarios. Unexpectedly, it also exposed an issue in your implementation. The more I delve into it, the more I find that RelDecorrelator and TopDownGeneralDecorrelator share striking similarities in their core design philosophies.

mihaibudiu

The code looks great and is very modular.
I think mostly documentation could be improved.
We should mark this as experimental, and once it's available I will try it over our entire test suite to validate correctness.

mihaibudiu · 2025-11-24T22:22:31Z

core/src/main/java/org/apache/calcite/rel/core/CorrelatePlus.java

+ * @see CoreRules#FILTER_SUBQUERY_REMOVE_ENABLE_MARK_JOIN
+ * @see CoreRules#PROJECT_SUBQUERY_REMOVE_ENABLE_MARK_JOIN
+ */
+public abstract class CorrelatePlus extends Correlate {


why not call it ConditionalCorrelate?

core/src/main/java/org/apache/calcite/rel/core/JoinRelType.java

mihaibudiu · 2025-11-24T22:26:06Z

core/src/main/java/org/apache/calcite/rel/logical/LogicalConditionalCorrelate.java

+        correlationId, requiredColumns, joinType, condition);
+  }
+
+  @Override public Correlate copy(RelTraitSet traitSet,


this is an unfortunate problem with the design of the copy method; I had the same problem when adding ASOF JOIN.
An alternative would be the CorrelatePlus not to extend Correlate.
I think that this shows that the design of the copy API is wrong, and this method should not exist.

core/src/main/java/org/apache/calcite/rel/rules/MarkToSemiOrAntiJoinRule.java

core/src/main/java/org/apache/calcite/sql2rel/TopDownGeneralDecorrelator.java

mihaibudiu · 2025-11-25T22:51:20Z

core/src/main/java/org/apache/calcite/sql2rel/TopDownGeneralDecorrelator.java

+      }
+      builder.join(JoinRelType.LEFT, leftJoinConditions);
+
+      // rewrite COUNT to case when


this does not seem to rewrite count, but rather to project the result produced by count

mihaibudiu · 2025-11-25T22:55:05Z

core/src/main/java/org/apache/calcite/sql2rel/TopDownGeneralDecorrelator.java

+      // the Sort with ORDER BY and LIMIT or OFFSET have to be changed during rewriting because
+      // now the limit has to be enforced per value of the outer bindings instead of globally.
+      // It can be rewritten using ROW_NUMBER() window function and filtering on it,
+      // see section 4.4 in paper


You need to cite both papers, the original one does not contain the sort, this algorithm is only in the second paper

mihaibudiu · 2025-11-25T22:59:59Z

core/src/main/java/org/apache/calcite/sql2rel/TopDownGeneralDecorrelator.java

+
+    if (!leftHasCorrelation && !join.getJoinType().generatesNullsOnRight()
+        && join.getJoinType().projectsRight()) {
+      // there is no need to push down domain D to left side when both are satisfied:


when both following conditions are satisfied

mihaibudiu · 2025-11-25T23:00:33Z

core/src/main/java/org/apache/calcite/sql2rel/TopDownGeneralDecorrelator.java

+      // there is no need to push down domain D to left side when both are satisfied:
+      // 1. there is no correlation on left side
+      // 2. join type will not generate NULL values on right side and will project right
+      // the left side will start a decorrelation independently


In this case, the left size...

silundong · 2025-11-26T06:54:11Z

Thank you for the detailed review! I will update in the next few days.

silundong · 2025-12-01T10:13:01Z

This commit mainly improves code style and comments. PTAL, thank you!

xiedeyantu

The previous comments I raised have all been addressed. Since I'm not very familiar with this module, we might want to wait for Mihai's final review.

mihaibudiu

Left one question about implementing the MARK_JOIN in the enumerable convention.
If you decide to do that, maybe we can also run some quidem tests using the new decorrelator?
The plans are very hard to evaluate by a person for correctness.

mihaibudiu · 2025-12-04T01:29:05Z

core/src/main/java/org/apache/calcite/rel/core/JoinRelType.java

+   *       LogicalTableScan(table=[[CATALOG, SALES, DEPT]])
+   * </pre></blockquote>
+   *
+   * <p> If the marker is used on only conjunctive predicates the optimizer will try to translate


do you mean "if the marker is used only in"?
So the output of MARK_JOIN has an extra nullable boolean field in addition to the fields from the left and the right inputs?

Do you want to have LEFT_MARK_JOIN and RIGHT_MARK_JOIN?
Why is a single MARK_JOIN sufficient?
Maybe you should call it LEFT_MARK_JOIN to be clear.

Yes, it should be "used only in".

The output of MARK JOIN is as described in the first sentence of the comment: An MARK JOIN will keep all rows from the left side and creates a new attribute to mark a tuple as having join partners from right side or not. It will not output the fields from the right side.

I looked at the SEMI/ANTI above — I think they implicitly represent (LEFT) SEMI/ANTI, and therefore define MARK directly. Should I stay consistent with that, or explicitly name it LEFT MARK? In my option, RIGHT MARK/SEMI/ANTI are variants of LEFT MARK/SEMI/ANTI to support join commutativity. Should I define them in this ticket?

If you expect to have a RIGHT_MARK, then I think you should call this one a LEFT_MARK, even if you are not going to add the RIGHT_MARK in this PR.

You are right, I will update it to LEFT MARK.

mihaibudiu · 2025-12-04T01:35:04Z

core/src/main/java/org/apache/calcite/adapter/enumerable/EnumerableJoinRule.java

-  @Override public RelNode convert(RelNode rel) {
+  @Override public @Nullable RelNode convert(RelNode rel) {
    Join join = (Join) rel;
+    if (!Bug.TODO_FIXED && join.getJoinType() == JoinRelType.MARK) {


Can we merge this PR without this implementation to validate actual computations?
From the description it does not look like having a native mark join in this convention is actually too much work.

silundong · 2025-12-04T10:27:34Z

Is it a good time to create a new ticket to implement LEFT MARK JOIN in the Enumerable convention? Should this PR wait for that implementation before being merged? Please let me know your preference. :)

asolimando · 2025-12-04T16:03:18Z

Is it a good time to create a new ticket to implement LEFT MARK JOIN in the Enumerable convention? Should this PR wait for that implementation before being merged? Please let me know your preference. :)

The PR is extremely helpful, if that's the only blocker (and IIRC the use of this new decorrelator is totally optional) I wouldn't block on that

mihaibudiu · 2025-12-04T22:43:57Z

Is it a good time to create a new ticket to implement LEFT MARK JOIN in the Enumerable convention? Should this PR wait for that implementation before being merged? Please let me know your preference. :)

As @asolimando says, this PR is already big enough, so you may as well add it in a separate PR, but then it means you can file the issue now about doing it.

mihaibudiu · 2025-12-04T22:44:56Z

I have already approved, so from my point of view feel free to merge after the conflicts are resolved and the commits are squashed. If anyone else plans to submit a review, please let us know to wait before merging.

… & Kemper)

sonarqubecloud · 2025-12-05T05:05:05Z

Quality Gate passed

Issues
11 New issues
0 Accepted issues

Measures
0 Security Hotspots
87.5% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

silundong · 2025-12-11T07:43:22Z

This PR looks ready to me. Would it be possible to merge it to advance CALCITE-7315? Thank you! @mihaibudiu @asolimando @xiedeyantu

mihaibudiu · 2025-12-11T07:48:42Z

Can you write some documentation someplace (could be JavaDoc or JIRA) about how one should proceed to replace the old decorrelator with this one?

Do you have a plan to migrate other tests to also use this decorrelator somehow?

silundong · 2025-12-11T09:35:21Z

Can you write some documentation someplace (could be JavaDoc or JIRA) about how one should proceed to replace the old decorrelator with this one?

Sure, I will add some JavaDoc in TopDownDecorrelator.java.

From my observation, the most commonly used tests are RelOptRulesTest and the Quidem tests. RelOptRulesTest can already switch the decorrelator via configuration. Quidem tests involves the entire SQL execution. After completing the LEFT MARK join under the Enumerable convention, may be we can integrate the new decorrelator into the Quidem tests and make the switch configurable in the iq files.

mihaibudiu reviewed Nov 7, 2025

View reviewed changes

iwanttobepowerful reviewed Nov 20, 2025

View reviewed changes

silundong marked this pull request as ready for review November 21, 2025 06:41

silundong requested a review from mihaibudiu November 21, 2025 06:41

iwanttobepowerful reviewed Nov 21, 2025

View reviewed changes

xiedeyantu reviewed Nov 24, 2025

View reviewed changes

mihaibudiu approved these changes Nov 25, 2025

View reviewed changes

silundong requested review from mihaibudiu and xiedeyantu December 1, 2025 10:13

xiedeyantu reviewed Dec 1, 2025

View reviewed changes

mihaibudiu approved these changes Dec 4, 2025

View reviewed changes

silundong force-pushed the subquery_corelated branch 2 times, most recently from 9a7b1d7 to 05b697f Compare December 5, 2025 02:02

[CALCITE-7031] Implement the general decorrelation algorithm (Neumann…

f3c4204

… & Kemper)

silundong force-pushed the subquery_corelated branch from 05b697f to f3c4204 Compare December 5, 2025 04:41

[CALCITE-7031] Implement the general decorrelation algorithm (Neumann & Kemper) #4619

Are you sure you want to change the base?

[CALCITE-7031] Implement the general decorrelation algorithm (Neumann & Kemper) #4619

Conversation

silundong commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mihaibudiu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

silundong Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

silundong commented Nov 7, 2025

Uh oh!

mihaibudiu commented Nov 7, 2025

Uh oh!

mihaibudiu commented Nov 7, 2025

Uh oh!

silundong commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mihaibudiu commented Nov 19, 2025

Uh oh!

silundong commented Nov 20, 2025

Uh oh!

mihaibudiu commented Nov 20, 2025

Uh oh!

iwanttobepowerful commented Nov 20, 2025

Uh oh!

iwanttobepowerful commented Nov 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mihaibudiu commented Nov 21, 2025

Uh oh!

silundong commented Nov 21, 2025

Uh oh!

silundong commented Nov 6, 2025 •

edited

Loading

silundong Nov 7, 2025 •

edited

Loading

silundong commented Nov 8, 2025 •

edited

Loading

iwanttobepowerful Nov 21, 2025 •

edited

Loading