@jwildeboer Which begs the question: why did they go with three bytes in the first place?

(No, I don't need an answer. MySQL has been breaking my brain for over two decades).

@clacke @jwildeboer The article goes into more depth. I just wish they would have taken the plunge in a release transition to make a clean break and make utf8 actually utf8 instead of making the default be not-entirely-correct.

@clacke @jwildeboer I mean, 5.0 was released in 2003, so it was too late for a major version release, but you'd think one of the point releases could make a breaking change.

That or make a 6.x release.

@craigmaloney @jwildeboer For many people UCS-2 was "Unicode" and the upper five bits were for sci-fi fans and archaeologists. Java, Windows, C "wide" encoding assumed this.

Four bytes seems so much and variable length seems so annoying.
@jwildeboer
I remember breaking database access for some windows machines, because i insisted on utf8mb4 😂

@jwildeboer Yes, for the last 12 years.
Even in open source, some products are good, some aren't.

@jwildeboer WTH...
Thanks for reminding me

*goes and checks DB

@jwildeboer Why "mb4" and not "mb6"? From RFC 3629 (or 'man utf-8'): «Another security issue occurs when encoding to UTF-8: the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore a risk of buffer overflow if the range of character numbers is not explicitly limited to U+10FFFF or if buffer sizing doesn't take into account the possibility of 5- and 6-byte sequences.»

@jwildeboer Interestingly, just found out that RFC3629 supersedes RFC2279, and one of the changes is the reduction of character range. Weird.

@smuglispweenie the 6 byte encoding comes with a hefty price - 2 bytes wasted as in „always zero“. 4 Bytes are more than sufficient to even integrate coding systems from other planets and galaxies ;)

@jwildeboer @smuglispweenie It's still UTF-8, so it's variable length encoding which would be *up to* 6 bytes, but as noted, yeah, limiting Unicode to 21 bits limits UTF-8 max length to 4 bytes.

@smuglispweenie @jwildeboer I think that excerpt is from a time (2003) when ISO 10646 imagined that Unicode would use codepoints higher than U+10FFFF. There have been ~5 major releases of ISO 10646 since then and while I haven’t found any online I would venture that they have since limited UTF-8 to four bytes, in line with the RFC.

There are other limits on UTF-8 too, all mentioned in the RFC: encoding surrogate codepoints is illegal; overlong sequences to hide (say) \0 are illegal.

Sign in to participate in the conversation
social.wildeboer.net

Mastodon instance for people with Wildeboer as their last name