So, you thought ahead when setting up your MySQL database and set your
character encoding to utf8
to make it easier to store international characters
and actually read them back out again.
But then someone sets their name to an emoji, and it isn't being read back
properly from the database. A little bit of digging reveals that utf8
on MySQL
is really just a subset of the full UTF-8 character set. What you
really wanted was utf8mb4. At this point, you have a few choices.
- Switch to PostgreSQL, MongoDB, or pretty much anything else
- Fix the charset and encoding in your MySQL database
There are a number of resources for switching the character set and collation for your MySQL databases, tables and columns. The best write-up is by Mathais, but there's also useful info from Alon Diamant if you happen to be running in AWS.
Please read everything you can before proceeding!!! This script attempts to
safely and automagically convert utf8
(and latin1
, if you are daring), but
it may not work with your dataset. Backup before proceeding, run it in a test
environment if you can.
Before proceeding:
- Backup. These scripts worked for me, but may cause you to lose all your data.
- Run it a few times in a test environment. Be sure this test environment is running the same version of MySQL; I've seen slightly different behaviors with different versions.
- Update MySQL configuration prior to migrating data, so that new
tables/colums can be correctly encoded with
utf8mb4
.
This app requires Node.js.
$ npm install -g https://github.com/building5/utf8mb4-converter.git
$ utf8mb4-converter [OPTIONS...]
- You made a backup already, right?
- Run
utf8mb4-converter
and inspect the DDL it will execute to see what it is going to do to your database. - If all looks good, you can either execute that generated script, or you can
run
utf8mb4-converter --make-it-so
to execute the DDL on the server.
If you have some data encoded as latin1
, and you are really, really sure
that it only has ASCII characters in it, then you can provide --force-latin1
to convert those databases/tables/columns to utf8mb4
. Any international
characters in those columns will probably be corrupted.
If you'd like to see more of what the script is doing, pass in --verbose
.
If there are some databases on your MySQL server you'd rather not convert, you
can pass them to --skip
to, well, skip them.
InnoDB has an index length limit of 767 bytes per column. For utf8mb3
, this
conveniently works out to 255 characters. But for utf8mb4
, this is only 191
characters. If you have any columns longer than that, you will either have to
limit the index to 191 characters of the column, or narrow the column to 191
characters.
ISC license. PRs welcome.