@jwildeboer Which begs the question: why did they go with three bytes in the first place?
(No, I don't need an answer. MySQL has been breaking my brain for over two decades).
@jwildeboer Why "mb4" and not "mb6"? From RFC 3629 (or 'man utf-8'): «Another security issue occurs when encoding to UTF-8: the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore a risk of buffer overflow if the range of character numbers is not explicitly limited to U+10FFFF or if buffer sizing doesn't take into account the possibility of 5- and 6-byte sequences.»
@jwildeboer Interestingly, just found out that RFC3629 supersedes RFC2279, and one of the changes is the reduction of character range. Weird.
@smuglispweenie the 6 byte encoding comes with a hefty price - 2 bytes wasted as in „always zero“. 4 Bytes are more than sufficient to even integrate coding systems from other planets and galaxies ;)
@smuglispweenie @jwildeboer I think that excerpt is from a time (2003) when ISO 10646 imagined that Unicode would use codepoints higher than U+10FFFF. There have been ~5 major releases of ISO 10646 since then and while I haven’t found any online I would venture that they have since limited UTF-8 to four bytes, in line with the RFC.
There are other limits on UTF-8 too, all mentioned in the RFC: encoding surrogate codepoints is illegal; overlong sequences to hide (say) \0 are illegal.
Mastodon instance for people with Wildeboer as their last name