-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
In a vitess (vreplication basedmigration, and in the cut-over phase, we take some locks, setup buffering, bring the tables (original & shadow) in sync, stop vreplication, then finally attempt to swap the tables viaRENAME` statement.
The cut-over may fail, and this can happen, and the migration should then resume running and try to cut-over later.
When the cut-over fails, we undo the buffering, we release locks, remove artifacts, etc. However, if the RENAME itself fails, vreplication remains Stopped. This means any new entries written to the original table are not propagated to the shadow table, essentially making the migration lag. It is then unlikely to ever attempt to cut-over again (unless there's just no more traffic), and will end up failing after 3 hours due to lack of vreplication liveness.
Solution: restart vreplication upon cut-over failure.