This PR (work in progress) introduces zero dependency online schema changes with gh-ost
/pt-online-schema-change
.
UPDATE: this comment edited to reflect support for pt-online-schema-change
. Originally this PR only supported gh-ost
. Mostly whenever you see gh-ost
, consider pt-online-schema-change
to apply, as well.
TL;DR
User will issue:
alter with 'gh-ost' table example modify id bigint not null;
alter with 'pt-osc' table example modify id bigint not null
or
$ vtctl -topo_implementation etcd2 -topo_global_server_address localhost:2379 -topo_global_root /vitess/global \
ApplySchema -sql "alter with 'gh-ost' table example modify id bigint unsigned not null" commerce
$ vtctl -topo_implementation etcd2 -topo_global_server_address localhost:2379 -topo_global_root /vitess/global \
ApplySchema -sql "alter with 'pt-osc' table example modify id bigint unsigned not null" commerce
and vitess will schedule an online schema change operation to run on all relevant shards, then proceed to apply the change via gh-ost
on all shards.
While this PR is WIP, this flow works. More breakdown to follow, indicating what's been done and what's still missing.
The ALTER TABLE problem
First, to iterate the problem: schema changes have always been a problem with MySQL; a straight ALTER
is a blocking operation; a ONLINE ALTER
is only "online" on the master/primary, but is effectively blocking on replicas. Online schema change tools like pt-online-schema-change
and gh-ost
overcome these limitations by emulating an ALTER
on a "ghost" table, which is populated from the original table, then swapped in its space.
For disclosure, I authored gh-ost
's code as part of the database infrastructure team at GitHub.
Traditionally, online schema changes are considered to be "risky". Trigger based migrations add significant load onto the master server, and their cut-over phase is known to be a dangerous point. gh-ost
was created at GitHub to address these concerns, and successfully eliminated concerns for operational risks: with gh-ost
the load on the master is low, and well controlled, and the cut-over phase is known to cause no locking issues. gh-ost
comes with different risks: it applies data changes programmatically, thus the issue of data integrity is of utmost importance. Another note of concern is data traffic: going out from MySQL into gh-ost
and back into MySQL (as opposed to all-in MySQL in pt-online-schema-change
).
This way or the other, running an online schema change is typically a manual operation. A human being will schedule the migration, kick it running, monitor it, possibly cut-over. In a sharded environment, a developer's request to ALTER TABLE
explodes to n
different migrations, each needs to be scheduled, kicked, monitored & tracked.
Sharded environments are obviously common for vitess
users and so these users feel the pain more than others.
Schema migration cycle & steps
Schema management is a process that begins with the user designing a schema change, and ends with the schema being applied in production. This is a breakdown of schema management steps as I know them:
- Design code
- Publish changes (pull request)
- Review
- Formalize migration command (the specific
ALTER TABLE
or pt-online-schema-change
or gh-ost
command)
- Locate: where in production should this migration run?
- Schedule
- Execute
- Audit/monitor
- Cut-over/complete
- Cleanup
- Notify user
- Deploy & merge
What we propose to address
Vitess's architecture uniquely positions it to be able to automate away much of the process. Specifically:
- Formalize migration command: turning an
ALTER TABLE
statement into a gh-ost
invocation is super useful if done by vitess, since vitess can not only validate schema/params, but also can provide credentials, identify a throttle-control replica, can instruct gh-ost
on how to communicate progress via hooks, etc.
- Locate: given schema/table,
vitess
just knows where the table is located. It knows if the schema is sharded. It knows who the shards are, who the shards masters are. It knows where to run gh-ost
. Last, vitess
can tell us which replicas we can use for throttling.
- Schedule: vitess is again in a unique position to schedule migrations. The fact someone asks for a migration to run does not mean the migration should start right away. For example, a shard may already be running an earlier migration. Running two migrations at a time is less than ideal, and it's best to wait out the first migration before beginning the second. A scheduling mechanism is both useful to running the migrations in optimal order/sequence, as well as providing feedback to the user ("your migration is on hold because this and that", or "your migration is 2nd in queue to run")
- Execute:
vttablet
is the ideal entity to run a migration; can read instructions from topo
server and can write progress to topo
server. vitess
is aware of possible master failovers and can request a re-execute is a migration is so interrupted mid process.
- Audit/monitor:
vtctld
API can offer endpoints to track status of a migration (e.g. "in progress on -80
, in queue on 80-
"). It may offer progress pct and ETA.
- cut-over/complete: in my experience with
gh-ost
, the cut-over phase is safe to automate away.
- cleanup: the old table needs to be dropped;
vttablet
is in an excellent position to automate that away.
What this PR does, and what we expect to achieve
The guideline for this PR is: zero added dependencies; everything must be automatically and implicitly available via a normal vitess
installation.
A breakdown:
User facing
This PR enables the user to run an online schema migration (aka online DDL) via:
vtgate
: the user connects to vitess
with their standard MySQL client, and issues a ALTER WITH 'gh-ost' TABLE ...
statement. Notice this isn't a valid MySQL syntax -- it's a hint for vitess
that we want to run this migration online. vitess
still supports synchronous, "normal" ALTER TABLE
statements, which IMO should be discouraged.
vtctl
: the user runs vtctl ApplySchema -sql "alter with _gh-ost' table ..."
.
The response, in both cases, is a migration ID, or a job ID, if you will. Consider the following examples.
via vtgate
:
mysql> create table example(id int auto_increment primary key, name tinytext);
mysql> show create table example \G
CREATE TABLE `example` (
`id` int NOT NULL AUTO_INCREMENT,
`name` tinytext,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
mysql> alter with 'gh-ost' table example modify id bigint not null, add column status int, add key status_dx(status);
+--------------------------------------+
| uuid |
+--------------------------------------+
| 211febfa-da2d-11ea-b490-f875a4d24e90 |
+--------------------------------------+
-- <wait...>
mysql> show create table example \G
CREATE TABLE `example` (
`id` bigint NOT NULL,
`name` tinytext,
`status` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `status_dx` (`status`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
via vtctl
:
$ mysql -e "show create table example\G"
CREATE TABLE `example` (
`id` bigint NOT NULL,
`name` tinytext,
`status` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `status_dx` (`status`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
$ vtctl -topo_implementation etcd2 -topo_global_server_address localhost:2379 -topo_global_root /vitess/global \
ApplySchema -sql "alter with 'gh-ost' table example modify id bigint unsigned not null" commerce
8ec347e1-da2e-11ea-892d-f875a4d24e90
$ mysql -e "show create table example\G"
CREATE TABLE `example` (
`id` bigint unsigned NOT NULL,
`name` tinytext,
`status` int DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `status_dx` (`status`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
In both cases, a UUID is returned, which can be used for tracking (WIP) the progress of the migration across shards.
Parser
Vitess' parser now accepts ALTER WITH 'gh-ost' TABLE
and ALTER WITH 'pt-osc' TABLE
syntax. We're still to determine if this is the exact syntax we want to go with.
Topo
Whether submitted by vtgate
or vtctl
, we don't immediately run the migration. As mentioned before, we may wish to postpone the migration. Perhaps the relevant servers are already running a migration.
Instead, we write the migration request into global topo
, e.g.:
- key:
/vitess/global/schema-migration/requests/90c5afd4-da38-11ea-a3ff-f875a4d24e90
- content:
{"keyspace":"commerce","table":"example","sql":"alter table example modify id bigint not null","uuid":"90c5afd4-da38-11ea-a3ff-f875a4d24e90","online":true,"time_created":1596701930662801294,"status":"requested"}
Once we create the request in topo
, we immediately return the generated UUID/migration ID (90c5afd4-da38-11ea-a3ff-f875a4d24e90
in the above example) to the user.
vtctld
vtctld
gets a conceptual "upgrade" with this PR. It is no longer a reactive service. vtctld
now actively monitors new schema-migration/requests
in topo
.
~~When it sees such a request, it evaluates what are the relevant n
shards.~~
~~With current implementaiton, it writes n
"job" entries, one per shard. e.g.~~
/vitess/global/schema-migration/jobs/commerce/-80/ce45b84a-da2d-11ea-b490-f875a4d24e90
and
/vitess/global/schema-migration/jobs/commerce/80-/ce45b84a-da2d-11ea-b490-f875a4d24e90
for a keyspace with two shards; or just
/vitess/global/schema-migration/jobs/commerce/0/1dd17132-da23-11ea-a3d2-f875a4d24e90
for a keyspace with one shard.
DONE: WIP: we will investigate use of new VExec
to actually distribute the jobs to vttablet
.
what vtctld
does now, is, once it sees a migration request, it pushes a VExec request for that migration. If the VExec request succeeds, that means all shards have been notified, and vtctld
can stow away the migration request (work is complete as far as vtctld
is concerned). If VExec returns with an error, that means at least one shard did not get the request, and vtctld
will keep retrying pushing this request.
vttablet
This is where most of the action takes place.
vttablet
runs a migration service which continuously probes for, schedules, and executes migrations.
DONE: ~~With current implementation, tablets which have tablet_type=MASTER
continuously probe for new entries. We look to replace this with VExec
.~~
migration requests are pushed via VExec
; the request includes the INSERT IGNORE
query that persists the migration in _vt.schema_migrations
. The tablet no longer reads from, nor writes to, Global Topo.
A new table is introduced: _vt.schema_migrations
, which is how vttablet
manages and tracks its own migrations.
vttablet
will only run a single migration at a time.
vttablet
will see if there's an unhandled migration requests. It will queue it.
vttablet
will make a migration ready
if there's no running migration and no other migration is marked as ready
.
vttablet
will run a ready
migration. This is really the interesting part, with lots of goodies:
vttablet
will evaluate the gh-ost ...
command to run. It will obviously populate --alter=... --database=...
.
vttablet
creates a temp directory where it generates a script to run gh-ost
.
vttablet
creates a hooks path and auto-generates hook files. The hooks will interact with vttablet
vttablet
has an API endpoint by which the hooks can communicate gh-ost
's status (started/running/success/failure) with vttablet
.
vttablet
provides gh-ost
with --hooks-hint
which is the migration's UUID.
vttablet
automatically generates a gh-ost
user on the MySQL server, with a random password. The password is never persisted and does not appear on ps
. It is written to, and loaded from, an environment variable.
vttablet
grants the proper privileges
on the newly created account
vttablet
will destroy the account once migration completes.
vitess
repo includes a gh-ost
binary. We require gh-ost
from openark/gh-ost
as opposed to github/gh-ost
because we've had to make some special adjustments to gh-ost
s oas to support this flow. I do not have direct ownership to github/gh-ost
and cannot enforce those changes upstream, though I have made the contribution requestss upstream.
make build
automatically appends gh-ost
binary, compressed, to vttablet
binary, via Ricebox.
vttablet
, upon startup, auto extracts gh-ost
binary into /tmp/vt-gh-ost
. Please note that the user does not need to install gh-ost.
- WIP:
vttablet
to report back the job as complete/failed. We look to use VExec
. TBD.
Tracking breakdown
- [x] New
OnlineDDL
struct, defines a migration request and its status
- [x] Parser supports
ALTER WITH 'gh-ost' TABLE
and ALTER WITH 'pt-osc' TABLE
syntax
- [x] builder and analyzer to create an Online DDL plan (write to
topo
)
- [x]
vtctl
to skip "big changes" check when -online_schema_change
is given
- [x]
tablet_executor
to submit an online DDL request to topo
as opposed to running it on tablets
- [x]
vtctld
runs a daemon to monitor for, and review migration requests
- [x]
vtctld
evaluates which shards are affected
- [x]
_vt.schema_migrations
backend table to support migration automation (on each shard))
- [x]
vttablet
validates MySQL connection and variables
- [x]
vttablet
creates migration command
- [x]
vttablet
creates hooks
- [x]
vttablet
provides HTTP API for hooks to report their status back
- [x]
vttablet
creates gh-ost
user with random password
- [x]
vttablet
destroys gh-ost
user upon completion
- [x]
gh-ost
embedded in vttablet
binary and auto-extracted by vttablet
- [x]
vttablet
runs a dry-run
execution
- [x]
vttablet
runs a --execute
(actual) execution
- [x]
vttablet
supports a Cancel
request (not used yet) to abort migration
- [x]
vttablet
as a state machine to work throught the migration steps
- [x] counters for
gh-ost
migration requests, suceessful and failed migrations
- [x] use of
VExec
to apply migrations onto tablets
- [x] use of
VExec
to control migrations (abort, retry)
- [ ] consider flow for retries
- [ ] identify a reparent operation that runs during a migration, probabaly auto-restart the migration
- [ ]
vttablet
to heuristically check for available disk space
- [x] tracking, auditing of all migrations
- [x] getting
gh-ost
logs if necessary
- [x] what's the best way to suggest we want an online migration? Does current
ALTER WITH 'gh-ost' TABLE...
and ALTER WITH 'pt-osc' TABLE
syntax make sense? Other?
- [ ] For first iteration, migrations and Reshard operations should be mutually exclusive. Can't run both at the same time. Next iterations will remove this constraint.
- [x] ~~throttle by replica~~
- [ ] ~~wait for replica to catch up with new credentials before starting the migration~~
- [x] Use vttablet throttler
- [x]
pt-online-schema-change
bundled inside vttablet
binary
- [x] support
pt-online-schema-change
- [ ] ~~define foreign key flags for
pt-online-schema-change
execution~~ - user can define as runtime flags
- [x] clenaup online-ddl directory after success
- [ ] control throttling
- [x] control termination (panic abort)
- [x] control termination (panic abort) even after
vttablet
itself crashes
- [x]
pt-online-schema-change
passwords are in cleartext. Can we avoid that?
- [x]
vtctl ApplySchema
use same WITH 'gh-ost'
and WITH 'pt-osc'
query hints as in vtgate
.
- [x] support override of
gh-ost
and pt-online-schema-change
paths
- [x] cleanup
pt-osc
triggers after migration failure
- [x] forcibly remove
pt-osc
triggers on migration cancellation (overlaps with previous bullet, but has stronger guarantee)
- [x] cleanup
pt-osc
triggers from stale/zombie pt-osc
migration
- [x]
vtctl OnlineDDL
command for simple visibility and manipulation. See https://github.com/vitessio/vitess/pull/6547#issuecomment-681879259
- [x] end to end tests
- [x] populate
artifacts
column, suggesting which tables need to be cleaned up after migration
Quite likely more entries to be added.
Further reading, resources, acknowledgements
We're obviously using gh-ost. I use my own openark/gh-ost
since I have no ownership of the original https://github.com/github/gh-ost. gh-ost
was/is developed by GitHub 2016-2020.
pt-online-schema-change
is part of the popular Percona Toolkit
The schema migratoin scheduling and tracking work is based on my previous work at GitHub. The implementation in this PR is new and rewritten, but based on concepts that have matured on my work on skeefree
. Consider these resources:
Also:
- An early presentation on gh-ost
Initial incarnation of this PR: https://github.com/planetscale/vitess/pull/67; some useful comments on that PR.
Call for feedback
We're looking for community's feedback on the above suggestions/flow. Thank you for taking the time to read and respond!