Skip to content

Conversation

@siddharth16396
Copy link
Contributor

@siddharth16396 siddharth16396 commented Jun 6, 2025

Description

This PR implements the feature request in #18335

TL;DR: While Vitess already supports launching schema changes shard by shard, there is currently no way to complete those migrations shard by shard. This PR addresses that gap.

This enhancement facilitates scenarios where teams aim to safely test the impact of schema changes, such as adding an index, on a specific shard before deploying it across the entire keyspace.

Example:
Consider a keyspace with multiple shards (e.g., X shards). The objective is to:

  • Introduce a new index to a table.
  • Evaluate the query performance improvements resulting from the index.
  • Analyze the resource impact (CPU, memory, IO) of the index on the shard.

To limit the blast radius, the schema change should be:

  • Applied and tested on a single shard.
  • Monitored for metrics and performance before a broader rollout.

This feature enables shard-level control and observability, allowing for targeted testing and validation.

The code changes in this PR add shard-specific completion to online alter

  • Enhance CompleteMigration to support shard-specific arguments
  • Update test cases to validate shard-based online DDL migrations
  • Modify query execution in tabletserver to include shard details
  • Adjust SQL parser mappings to incorporate shard-specific changes
  • Refactor OnlineDDL executor logic for better shard-based migration handling

This update improves the granularity of migration completion, allowing
shard-specific operations for postponed migrations.

Vitess #general slack discussion with shlomi:: https://vitess.slack.com/archives/C0PQY0PTK/p1737524865276269

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

testing Notes

  1. Ran ApplySchema with postpone-completion to add sample table called "Persons"
[root@/home/udocker/db #]vtctldclient ApplySchema --ddl-strategy="vitess --postpone-completion" --sql "CREATE TABLE Persons ( PersonID int,LastName varchar(255),FirstName varchar(255),Address varchar(255),City varchar(255));" arif_test
9d73442c_41d2_11f0_82af_02001707a39a

  1. Ran OnlineDDL show on keyspace ::
[root@/home/udocker/db #]vtctldclient  OnlineDDL show arif_test 9d73442c_41d2_11f0_82af_02001707a39a --json
{
  "migrations": [
    {
      "uuid": "9d73442c_41d2_11f0_82af_02001707a39a",
      "keyspace": "arif_test",
      "shard": "-55",
      "schema": "arif_test",
      "table": "Persons",
      "migration_statement": "create table Persons (\n\tPersonID int,\n\tLastName varchar(255),\n\tFirstName varchar(255),\n\tAddress varchar(255),\n\tCity varchar(255)\n)",
      "strategy": "VITESS",
      "options": "--postpone-completion",
      "added_at": {
        "seconds": "1749103328",
        "nanoseconds": 0
      },
      "requested_at": {
        "seconds": "1749103329",
        "nanoseconds": 0
      },
      "ready_at": null,
      "started_at": null,
      "liveness_timestamp": null,
      "completed_at": null,
      "cleaned_up_at": null,
      "status": "QUEUED",
      "log_path": "",
      "artifacts": "",
      "retries": "0",
      "tablet": {
        "cell": "xxx",
        "uid": xxxx
      },
      "tablet_failure": false,
      "progress": 0,
      "migration_context": "vtctl:9c2b8bae-41d2-11f0-82af-02001707a39a",
      "ddl_action": "create",
      "message": "",
      "eta_seconds": "-1",
      "rows_copied": "0",
      "table_rows": "0",
      "added_unique_keys": 0,
      "removed_unique_keys": 0,
      "log_file": "",
      "artifact_retention": {
        "seconds": "86400",
        "nanos": 0
      },
      "postpone_completion": true,
      "removed_unique_key_names": "",
      "dropped_no_default_column_names": "",
      "expanded_column_names": "",
      "revertible_notes": "",
      "allow_concurrent": false,
      "reverted_uuid": "",
      "is_view": false,
      "ready_to_complete": true,
      "vitess_liveness_indicator": "0",
      "user_throttle_ratio": 0,
      "special_plan": "",
      "last_throttled_at": null,
      "component_throttled": "",
      "cancelled_at": null,
      "postpone_launch": false,
      "stage": "",
      "cutover_attempts": 0,
      "is_immediate_operation": true,
      "reviewed_at": {
        "seconds": "1749103330",
        "nanoseconds": 0
      },
      "ready_to_complete_at": {
        "seconds": "1749103330",
        "nanoseconds": 0
      },
      "removed_foreign_key_names": ""
    },
    {
      "uuid": "9d73442c_41d2_11f0_82af_02001707a39a",
      "keyspace": "arif_test",
      "shard": "55-aa",
      "schema": "arif_test",
      "table": "Persons",
      "migration_statement": "create table Persons (\n\tPersonID int,\n\tLastName varchar(255),\n\tFirstName varchar(255),\n\tAddress varchar(255),\n\tCity varchar(255)\n)",
      "strategy": "VITESS",
      "options": "--postpone-completion",
      "added_at": {
        "seconds": "1749103328",
        "nanoseconds": 0
      },
      "requested_at": {
        "seconds": "1749103329",
        "nanoseconds": 0
      },
      "ready_at": null,
      "started_at": null,
      "liveness_timestamp": null,
      "completed_at": null,
      "cleaned_up_at": null,
      "status": "QUEUED",
      "log_path": "",
      "artifacts": "",
      "retries": "0",
      "tablet": {
        "cell": "xxx",
        "uid": xxxx
      },
      "tablet_failure": false,
      "progress": 0,
      "migration_context": "vtctl:9c2b8bae-41d2-11f0-82af-02001707a39a",
      "ddl_action": "create",
      "message": "",
      "eta_seconds": "-1",
      "rows_copied": "0",
      "table_rows": "0",
      "added_unique_keys": 0,
      "removed_unique_keys": 0,
      "log_file": "",
      "artifact_retention": {
        "seconds": "86400",
        "nanos": 0
      },
      "postpone_completion": true,
      "removed_unique_key_names": "",
      "dropped_no_default_column_names": "",
      "expanded_column_names": "",
      "revertible_notes": "",
      "allow_concurrent": false,
      "reverted_uuid": "",
      "is_view": false,
      "ready_to_complete": true,
      "vitess_liveness_indicator": "0",
      "user_throttle_ratio": 0,
      "special_plan": "",
      "last_throttled_at": null,
      "component_throttled": "",
      "cancelled_at": null,
      "postpone_launch": false,
      "stage": "",
      "cutover_attempts": 0,
      "is_immediate_operation": true,
      "reviewed_at": {
        "seconds": "1749103330",
        "nanoseconds": 0
      },
      "ready_to_complete_at": {
        "seconds": "1749103330",
        "nanoseconds": 0
      },
      "removed_foreign_key_names": ""
    },
    {
      "uuid": "9d73442c_41d2_11f0_82af_02001707a39a",
      "keyspace": "arif_test",
      "shard": "aa-",
      "schema": "arif_test",
      "table": "Persons",
      "migration_statement": "create table Persons (\n\tPersonID int,\n\tLastName varchar(255),\n\tFirstName varchar(255),\n\tAddress varchar(255),\n\tCity varchar(255)\n)",
      "strategy": "VITESS",
      "options": "--postpone-completion",
      "added_at": {
        "seconds": "1749103328",
        "nanoseconds": 0
      },
      "requested_at": {
        "seconds": "1749103329",
        "nanoseconds": 0
      },
      "ready_at": null,
      "started_at": null,
      "liveness_timestamp": null,
      "completed_at": null,
      "cleaned_up_at": null,
      "status": "QUEUED",
      "log_path": "",
      "artifacts": "",
      "retries": "0",
      "tablet": {
        "cell": "xxx",
        "uid": xxx
      },
      "tablet_failure": false,
      "progress": 0,
      "migration_context": "vtctl:9c2b8bae-41d2-11f0-82af-02001707a39a",
      "ddl_action": "create",
      "message": "",
      "eta_seconds": "-1",
      "rows_copied": "0",
      "table_rows": "0",
      "added_unique_keys": 0,
      "removed_unique_keys": 0,
      "log_file": "",
      "artifact_retention": {
        "seconds": "86400",
        "nanos": 0
      },
      "postpone_completion": true,
      "removed_unique_key_names": "",
      "dropped_no_default_column_names": "",
      "expanded_column_names": "",
      "revertible_notes": "",
      "allow_concurrent": false,
      "reverted_uuid": "",
      "is_view": false,
      "ready_to_complete": true,
      "vitess_liveness_indicator": "0",
      "user_throttle_ratio": 0,
      "special_plan": "",
      "last_throttled_at": null,
      "component_throttled": "",
      "cancelled_at": null,
      "postpone_launch": false,
      "stage": "",
      "cutover_attempts": 0,
      "is_immediate_operation": true,
      "reviewed_at": {
        "seconds": "1749103330",
        "nanoseconds": 0
      },
      "ready_to_complete_at": {
        "seconds": "1749103330",
        "nanoseconds": 0
      },
      "removed_foreign_key_names": ""
    }
  ]
}
  1. Ran command to complete migration on a specific shard only ::
[root@/home/udocker/db #]vtctldclient ApplySchema  --sql "alter vitess_migration '9d73442c_41d2_11f0_82af_02001707a39a' complete vitess_shards 'aa-'" arif_test
  1. Exec'd into a different shard and verified table "Persons" does not exist ::
mysql> use arif_test;
mysql> show tables;
+----------------------------------------------------------+
| Tables_in_arif_test                                      |
+----------------------------------------------------------+
| _vt_vrp_95399e6c412f11f0948f0200173485a5_20250604104017_ |
| employees                                                |
| testTable                                                |
+----------------------------------------------------------+
3 rows in set (0.00 sec)
  1. Exec'ed into the shard we applied completion on and saw table "Persons" existed. ::
mysql> show tables;
+----------------------------------------------------------+
| Tables_in_arif_test                                      |
+----------------------------------------------------------+
| Persons                                                  |
| _vt_vrp_95399e6c412f11f0948f0200173485a5_20250604104017_ |
| employees                                                |
| testTable                                                |
+----------------------------------------------------------+
4 rows in set (0.00 sec)

@vitess-bot
Copy link
Contributor

vitess-bot bot commented Jun 6, 2025

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Jun 6, 2025
@github-actions github-actions bot added this to the v23.0.0 milestone Jun 6, 2025
@siddharth16396 siddharth16396 force-pushed the complete_ddl_shard_by_shard branch from 90ed323 to d79d7ac Compare June 6, 2025 08:48
@siddharth16396 siddharth16396 changed the title Feature(onlineddl): Add shard-specific completion to online alter Feature(onlineddl): Add shard-specific completion to online ddl Jun 6, 2025
@deepthi deepthi added Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) Type: Feature and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Jun 9, 2025
@codecov
Copy link

codecov bot commented Jun 9, 2025

Codecov Report

Attention: Patch coverage is 0% with 7 lines in your changes missing coverage. Please review.

Project coverage is 67.50%. Comparing base (d429368) to head (c87e90f).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/vttablet/onlineddl/executor.go 0.00% 6 Missing ⚠️
go/vt/vttablet/tabletserver/query_executor.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18331      +/-   ##
==========================================
- Coverage   67.51%   67.50%   -0.02%     
==========================================
  Files        1607     1607              
  Lines      262684   262688       +4     
==========================================
- Hits       177343   177315      -28     
- Misses      85341    85373      +32     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@GrahamCampbell
Copy link
Contributor

Doesn't this get much more complex when planning scatter queries that use select *?

@GrahamCampbell
Copy link
Contributor

I guess if we limit the class of changes that are allowed it could work out. Your use case for changing onyl indexes would fit into this.

@siddharth16396
Copy link
Contributor Author

siddharth16396 commented Jun 10, 2025

Thanks for the note, @GrahamCampbell

You're right that scatter queries using SELECT * can run into issues if the schema is not consistent across shards. That said, even today Vitess already allows schema changes to be launched only on specific shards using ApplySQLSchema with --postpone-launch and then selectively launching on those shard. So this PR doesn’t introduce that inconsistency — it just extends support to complete migrations at the shard level, which currently isn't possible.

At our organisation, we have an internal mandate to do safe, staged schema changes. This means:

  • We always roll out schema changes shard by shard, starting with a small percentage (5%), then scaling to 25%, 50%, and finally 100%.
  • Our services are built to ignore unknown fields if fetched using selects, and schema changes are always deployed before service code changes. This way, we maintain forward compatibility even if the schema is ahead of the app in some shards and i believe everyone should be doing this (if they are not).

In terms of correctness, i believe it's the responsibility of the operator or database admin to plan and execute shard-by-shard rollouts safely. Enforcing consistency or rollout strategy checks at the Vitess code level would add unnecessary complexity and may not fit all use cases.

This PR gives operators the control and flexibility, while assuming they will apply it responsibly.

@siddharth16396 siddharth16396 force-pushed the complete_ddl_shard_by_shard branch from ec39edc to f1f1e11 Compare June 10, 2025 16:16
Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! This is the exact same way I would have implemented this. A very minor suggestion for the endtoend test.

Copy link
Contributor

@shlomi-noach shlomi-noach Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add an artificial sleep here just to ensire we're not missing a would-be completion

Suggested change
// Migration should still be in running state
time.Sleep(2 * time.Second)
// Migration should still be in running state

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop this file's changes. The correct place is in onlineddl_scheduler_test.go. I see the test is also failing, but in all honesty, I think we should just remove these changes rather than debug this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this test was initially failing bcoz i couldn't figure out how to run endtoend tests on my local.
But somehow with the help of github CI pipelines i was able to debug and fix the test.
Screenshot 2025-07-01 at 1 38 55 PM

The current CI is not failing bcoz of this test, but some other tests which i haven't touched.
Screenshot 2025-07-01 at 1 36 55 PM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with respect to correct place for this test, i can place it in onlineddl_scheduler_test.go

functionality

- Enhance `CompleteMigration` to support shard-specific arguments
- Update test cases to validate shard-based online DDL migrations
- Modify query execution in `tabletserver` to include shard details
- Adjust SQL parser mappings to incorporate shard-specific changes
- Refactor OnlineDDL executor logic for better shard-based migration
  handling

This update improves the granularity of migration completion, allowing
 shard-specific operations for postponed migrations.

 Signed-off-by: Siddharth Singh <siddharth16396@@gmail.com>

Signed-off-by: siddharth16396 <[email protected]>
Signed-off-by: siddharth16396 <[email protected]>
@siddharth16396
Copy link
Contributor Author

WRT the website documentation, i'll create a PR soon.

@siddharth16396
Copy link
Contributor Author

@shlomi-noach : i have incoporated your reviews.

also i think by mistake (or because i pushed another commit) it requires a re-stamp from you.

@shlomi-noach
Copy link
Contributor

Thank you, looking again. Please do not force-push into reviewed PRs as this means I'll need to re-review the entire PR without understanding what changes were made since my last review 🙏

@siddharth16396
Copy link
Contributor Author

siddharth16396 commented Jul 1, 2025

@shlomi-noach : sorry for the force push, i messed a few things up in my local branch and had no option.

only change i made from last review was :: c87e90f

Which was to incorp your this comment:

Let's add an artificial sleep here just to ensire we're not missing a would-be completion

Nothing else is changed from previous stamp.

@shlomi-noach
Copy link
Contributor

I can see why the test is failing, as the onlineddl_vrepl test has a long successive list of tests which are not isolated, and you've injected one that changes the expected state. We can make that test successful - but I still argue that we shouldn't run the test in onlineddl_vrepl. The tests you've added in onlineddl_scheduler` suffice, and that is the correct test suite for this particular test.

Signed-off-by: Shlomi Noach <[email protected]>
@shlomi-noach
Copy link
Contributor

I pushed a fix and onlineddl_vrepl now passes. I'd still like to remove this entire test.

@siddharth16396
Copy link
Contributor Author

siddharth16396 commented Jul 1, 2025

Thanks a lot for the fix, and like you suggested i will remove the test from onlineddl_vrepl

087bdbb

Updated with new commit to remove that test entirely.

Really appreciate all the help 🙇

@rohit-nayak-ps rohit-nayak-ps removed the NeedsWebsiteDocsUpdate What it says label Jul 1, 2025
@rohit-nayak-ps rohit-nayak-ps merged commit 5aefdb0 into vitessio:main Jul 1, 2025
100 of 111 checks passed
morgo added a commit to morgo/vitess that referenced this pull request Jul 7, 2025
…tests

* origin/master: (32 commits)
  test: Fix race condition in TestStreamRowsHeartbeat (vitessio#18414)
  VReplication: Improve permission check logic on external tablets on SwitchTraffic (vitessio#18348)
  Perform post copy actions in atomic copy (vitessio#18411)
  Update `operator.yaml` (vitessio#18364)
  Feature(onlineddl): Add shard-specific completion to online ddl (vitessio#18331)
  Set parsed comments in operator for subqueries (vitessio#18369)
  `vtorc`: move shard primary timestamp to time type (vitessio#18401)
  `vtorc`: rename `isClusterWideRecovery` -> `isShardWideRecovery` (vitessio#18351)
  `vtorc`: remove dupe keyspace/shard in replication analysis (vitessio#18395)
  Topo: Add NamedLock test for zk2 and consul and get them passing (vitessio#18407)
  Handle MySQL 9.x as New Flavor in getFlavor() (vitessio#18399)
  Add support for sending grpc server backend metrics via ORCA (vitessio#18282)
  asthelpergen: add design documentation (vitessio#18403)
  `vtorc`: add keyspace/shard labels to recoveries stats (vitessio#18304)
  `vtorc`: cleanup `database_instance` location fields (vitessio#18339)
  avoid derived tables for UNION when possible (vitessio#18393)
  [Bugfix] Broken Heartbeat system in Row Streamer (vitessio#18390)
  Update MAINTAINERS.md (vitessio#18394)
  move vmg to emeritus (vitessio#18388)
  Make sure to check if the server is closed in etcd2topo (vitessio#18352)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) Type: Feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants