- I wish I’d had the courage to live a life true to myself, not the life others expected of me.
- I wish I hadn’t worked so hard.
- I wish I’d had the courage to express my feelings.
- I wish I had stayed in touch with my friends.
- I wish that I had let myself be happier.
Also, read The Life Reports.
You are browsing my brows with your internet browser.
Why didn’t I think of this response for my web101 assignment?!
Thinking With Portals of the Day: Most creative use yet of Facebook’s new look. (And Before you say anything, she’s aware of the arm thing and is going to fix it. So chell out.)
[reddit.]
Ow, that makes my head hurt!
Crazy Monster and his missing coffee.
Good is the enemy of Great
Latin-1 is the enemy of UTF-8You write web apps. You understand the web is global, and want to support internationalization. You want UTF-8.
UTF-8 is extremely sane. Well, as sane as an encoding can be that features backwards-compatibility with ASCII.
Everything you care about supports UTF-8. Trust me: you want it everywhere.
Problem is, every last part of the web-application stack will fight you on your quest towards UTF-8 purity. What follows is a playbook to win your pervasive-UTF-8 battle.
First, you’re going to need diagnostic tools. There are two main weapons:
A hex editor and traffic dumper.
The programs you use to view text, be it dynamic from a tool’s output (Console.app) or a static file like a database dump (TextEdit, BBEdit, TextMate), have encoding logic. They will attempt to auto-detect encoding and paint you a pretty picture.
Avoid them. When debugging, you don’t want a pretty picture, you want The Truth. You need to be able to see raw byte-streams to debug this stuff.
A common problem is mixed encodings. That is, a file or stream that says it’s UTF-8 but has a chunk of Latin-1 in it. This is invisible corruption since most software won’t alert you when it hits mixed encodings (BBEdit is a notable exception).
Using a hex editor or viewing raw hex streams allows you to spot when a character that should be taking up three bytes (UTF-8) is only taking one (Latin-1).
A Unicode Canary-in-a-Coal-Mine.
You need a chunk of data that exercises the Unicode system: a sentinel value that you can push through your stack and make sure it survives a round-trip intact.
Initially I went with something like “tésting”, but it turns out that’s not enough — it will losslessly survive undesired transcoding to Latin-1 and back again.
No, you need something hard-core: “Iñtërnâtiônàlizætiøn” (complete with curly quotes).
(If you can’t read that word in your browser, it looks like the word “Internationalization” that’s had an umlaut omelet thrown in its face, and you’ve discovered an yet another encoding error somewhere between where I’m typing this and where you’re reading it.)
“Iñtërnâtiônàlizætiøn” is a great word to push through your systems because it can’t be represented in Latin-1 and will catch all sorts of hidden failure scenarios. Coupled the viewing raw hex, there’s no place for encoding bugs to hide.
(For the record, “Iñtërnâtiônàlizætiøn” looks like E2 80 9C 49 C3 B1 74 C3 AB 72 6E C3 A2 74 69 C3 B4 6E C3 A0 6C 69 7A C3 A6 74 69 C3 B8 6E E2 80 9D in UTF-8 in hex.)
※ ※ ※
OK, those are your weapons. Now for some concrete tips, starting from the bottom-up:
MySQL DDL: MySQL uses Latin1 by default. You need to set
default charsettoutf8andcollatetoutf8_unicode_ci:drop table if exists t_my_table; create table t_my_table ( ... ) engine=innodb default charset=utf8 collate=utf8_unicode_ci;The major gotcha here is that if you fail to specify
default charset=utf8in your DDL, it will default to Latin1 but simple storing and retrieval of UTF-8 will still work.This is because there are no invalid characters in Latin-1 (well, except for NUL (0x00)). You can jam anything in there and MySQL will dutifully store it for you and give it back when asked.
No errors, no warnings.
MySQL Importing/Restoration: Consider the following file,
myutf8.sql:drop table if exists myutf8_table; create table myutf8_table ( demo varchar(255) ) engine=innodb default charset=utf8 collate=utf8_unicode_ci; insert into myutf8_table values ('“Iñtërnâtiônàlizætiøn”');Now let’s load it up:
mysql -e 'drop database if exists myutf8_db;create database myutf8_db;' mysql myutf8_db < myutf8.sqlYou’ve already failed.
You see,
myutf8.sql’s file encoding was UTF-8, but no one toldmysqlthat. Somysqlassumed Latin-1 and corrupted the data.Looking at
myutf8_table’s lonely single row in Querious, I see it has a value of“Iñtërnâtiônà lizætiøn‗ a far cry from the“Iñtërnâtiônàlizætiøn”value we intended.Fortunately it’s easy to instruct
mysqlthat an input file has a specific encoding:mysql -e 'drop database if exists myutf8_db;create database myutf8_db;' mysql --default-character-set=utf8 myutf8_db < myutf8.sqlThat
--default-character-set=utf8makes all the difference. I recommend using it all the time — I’ve gotten to the point where I’m nervous if I spot an invocation ofmysqlthat lacks an explicit--default-character-set=utf8.MySQL Exporting/Backup: Use
--default-character-set=utf8like you do when importing:mysqldump --user=root --opt --default-character-set=utf8 myutf8_dbRelately, Mo McRoberts has a nice post on when MySQL encodings go bad.
JDBC Connection URL: It’s been a while since I’ve used Java, but it looks like you want to set two options,
useUnicodeandcharacterEncoding:jdbc:mysql:///myutf8_db?useUnicode=true&characterEncoding=UTF-8HTTP Headers: Your web server should vend a
Content-Typeoftext/html;charset=utf-8.HTML Documents: In theory your web server should be configured to declare all your HTML content as UTF-8 with its
Content-TypeHTTP header, but unfortunately that’s not always something you can control. You can also declare your UTF-8 conformance in the HTML document itself with ametatag:<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="content-type" content="text/html; charset=utf-8"> </head> <body> </body> </html>HTML Forms: Specify
accept-charsetin your<form>tag to tell the browser to submit user-entered data encoded in UTF-8:<form action="foo" accept-charset="UTF-8">...</form>Ajax/XHR/XMLHTTPRequest: Don’t sweat it, the W3C XMLHTTPRequest standard specifies POST data will always be encoded with the UTF-8 charset.