Hello, everyone!
Here is my route to upgrading the UTF-8 data stored in MySQL as latin1 to a utf8-based charset and collation.
First, a conversion shell script:
1 DBFROM=dp
2 DBTO=dp
3 LOGIN=dp
4 PASS=XXX
5 mysqldump --extended-insert=FALSE --default-character-set=latin1 -u $LOGIN -p$PASS $DBFROM >dp.sql
6 cat dp.sql |sed -e 's/DEFAULT CHARSET=latin1;/DEFAULT CHARSET=utf8 COLLATE utf8_bin;/'>dp2.sql
7 cat dp2.sql |sed -e 's/SET NAMES latin1/SET NAMES utf8/'>dp3.sql
8 echo " drop database $DBTO; create database $DBTO character set utf8 collate utf8_bin;"|mysql -u $LOGIN -p$PASS
9 mysql -u $LOGIN -p$PASS $DBTO <dp3.sql
Let us go over the lines.
Lines 1-4 configure the connection options.
Line 5 dumps the contents of the database. The --default-character-set=latin1
option disables the recoding of the date to the current locale of the system. Since the MySQL considers your UTF-8 data to be in latin1, this may result in data corruption. We also set --extended-insert=FALSE
to easily detect the errors.
Line 6 passes over the dump of the data and changes the declaration of the default charset to UTF-8. It also specifies the collation scheme to be utf8_bin
. This is the trickiest point and it took me a couple of hours to figure it out. Your fake latin1 data uses the latin1_swedish_ci
collation by default where ci stands for case-insensitive. If you dump your data with the latin1 charset and latin1_swedish_ci
collation and try to insert with the UTF-8 charset and its default utf8_general_ci
collation, you will likely get unique key contraint errors as MySQL will start to consider the cyrillic А to be equal to а, and the greek Ρ to be equal to ρ just and it earlier considered the latin A to be equal to a.