Skip to content

Online DDL: vreplication does not resume operation after RENAME failure #18427

@shlomi-noach

Description

@shlomi-noach

In a vitess (vreplication basedmigration, and in the cut-over phase, we take some locks, setup buffering, bring the tables (original & shadow) in sync, stop vreplication, then finally attempt to swap the tables viaRENAME` statement.

The cut-over may fail, and this can happen, and the migration should then resume running and try to cut-over later.

When the cut-over fails, we undo the buffering, we release locks, remove artifacts, etc. However, if the RENAME itself fails, vreplication remains Stopped. This means any new entries written to the original table are not propagated to the shadow table, essentially making the migration lag. It is then unlikely to ever attempt to cut-over again (unless there's just no more traffic), and will end up failing after 3 hours due to lack of vreplication liveness.

Solution: restart vreplication upon cut-over failure.

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions