The UTF-8 BOM

I was looking into a bug with Premus and discovered that it was caused by the page requested starting out with a BOM (Byte Order Marker). I’d seen BOMs many times before in UTF-16 documents, but I’ve never actually seen a UTF-8 BOM before, which I now find quite amazing since they’re completely valid and have been around for a long time.

Now, the problem with the UTF-8 BOM in particular is that Python doesn’t automatically strip it out if it’s there. And it looks like other languages have the same problem. Python 2.5 does add a specific encoding for UTF-8 with a BOM, but that seems to imply that you should know if you’ll be getting a BOM from the start. Another problem is of course that only one of my machines is running 2.5 so far.

I solved my problem by better keeping track of what input encoding I get and with the help of Evan Jones’ very helpful notes on UTF-8 and Python, but this really seems to me like a problem that shouldn’t have been there in the first place.

Leave a Reply