Skip to content

Conversation

@shlomi-noach
Copy link
Contributor

Description

Several enhancements to the Online DDL cut-over logic:

  • With forced cut-over, we now kill queries and transactions holding onto locks on migrated table, not only before the cut-over, but also while the RENAME is being applied. Without this, there's a race condition where a long running query could start running just after queries are killed and right before the RENAME starts running.
    When killing queries & transactions, we skip the connection IDs of the cut-over related queries themselves.
  • Sanitized and reduced query and lock timeouts during the cut-over. There were some excessive timeouts, notably an overlooked 5*onlineDDL.CutOverThreshold*4 value which evaluates to 5min on a 15s timeout.
  • Improved logging to always include migration UUID.
  • When --force-cut-over-after value is <= 1ms we consider it as "immediate" even if we somehow measure the time-since-ready to be less than that.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

@shlomi-noach shlomi-noach requested a review from a team July 7, 2025 06:49
@shlomi-noach shlomi-noach added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) labels Jul 7, 2025
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Jul 7, 2025

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Jul 7, 2025
@github-actions github-actions bot added this to the v23.0.0 milestone Jul 7, 2025
@shlomi-noach shlomi-noach removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request NeedsBackportReason If backport labels have been applied to a PR, a justification is required labels Jul 7, 2025
Signed-off-by: Shlomi Noach <[email protected]>
@codecov
Copy link

codecov bot commented Jul 7, 2025

Codecov Report

Attention: Patch coverage is 11.36364% with 39 lines in your changes missing coverage. Please review.

Project coverage is 67.49%. Comparing base (826d78d) to head (b1f59c5).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
go/vt/vttablet/onlineddl/executor.go 11.36% 39 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #18423      +/-   ##
==========================================
- Coverage   67.51%   67.49%   -0.02%     
==========================================
  Files        1607     1607              
  Lines      262706   262768      +62     
==========================================
- Hits       177370   177360      -10     
- Misses      85336    85408      +72     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
@shlomi-noach shlomi-noach requested a review from Copilot July 8, 2025 05:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the Online DDL cut-over by killing queries during the RENAME phase, tightening cut-over timeouts, improving logging with migration UUIDs, and treating very small force-cut-over thresholds as immediate.

  • Extend killTableLockHoldersAndAccessors to accept excluded connection IDs and include UUID in logs
  • Reduce excessive lock-wait timeouts and simplify force-cut-over logic for “immediate” thresholds
  • Update tests to cover new force-cut-over behavior and fix test names

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
executor_test.go Renamed and added test cases for force-cut-over threshold logic
executor.go Added UUID to kill logic, skip exclusions, refined timeouts, and adjusted context cancellation during rename
Comments suppressed due to low confidence (1)

go/vt/vttablet/onlineddl/executor_test.go:137

  • [nitpick] The test name is ambiguous: it refers to "microsecond" but the threshold is 1ms. Consider renaming to clarify that force-cut-over at or below 1ms is immediate.
			name:                     "microsecond, ready irrespective of sinceReadyToComplete",

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
@shlomi-noach shlomi-noach merged commit 2df46c7 into vitessio:main Jul 10, 2025
104 of 106 checks passed
@shlomi-noach shlomi-noach deleted the onlineddl-cutover-enhancements branch July 10, 2025 07:11
morgo added a commit to morgo/vitess that referenced this pull request Jul 21, 2025
* origin/master:
  bugfix: Fix impossible query for UNION (vitessio#18463)
  fix topo use in local_example (vitessio#18357)
  fix: update go-upgrade tool to check patch number (vitessio#18252) (vitessio#18402)
  Update MAINTAINERS.md and CODEOWNERS (vitessio#18462)
  Add logging to binlog watcher actions (vitessio#18264)
  `schemadiff`: `RelatedForeignKeyTables()` (vitessio#18195)
  `vtorc`: allow recoveries to be disabled from startup (vitessio#18005)
  Fix `vttablet` not being marked as not serving when MySQL stalls (vitessio#17883)
  make xtrabackup ShouldDrainForBackup configurable (vitessio#18431)
  Reset in-memory sequence info on vttablet on UpdateSequenceTables request (vitessio#18415)
  Fix watcher storm during topo outages (vitessio#18434)
  Online DDL: resume vreplication after cut-over/RENAME failure (vitessio#18428)
  Online DDL cutover enhancements (vitessio#18423)
  VStreamer: change in filter logic (vitessio#18319)
  Online DDL metrics: `OnlineDDLStaleMigrationMinutes` (vitessio#18417)

Signed-off-by: Morgan Tocker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) Type: Enhancement Logical improvement (somewhere between a bug and feature)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants