This will expand on a bit of data mining I did to help in evaluating the iconv
implementation from musl libc, in light of the challenges faced by an email server in the field. The trouble is that email was something of a ground zero of the text format wars, a weakly governed commons where the definition of "text" became the whole collection of random schemes hatched by anyone who got enough consumer-sheep bleating for support until the bad behavior became mandatory. "Was", I say, because the bright side is that it's been relatively settled for some time now; the sheep got the softforks they were after and now the most that seems to happen is new emoji getting added to unicode, a small manifestation of the larger cultural slide into the new illiteracy.
_ ASCII ribbon campaign ( ) against HTML e-mail X / \
Never forget.
As I've run my own mail servers for a while, I had a decent sized data set to work with, admittedly with the initial filter of me; or to be more precise, whatever was dumped in my inbox by marketers, ivy-covered professors and whatever other robots managed to find me, plus a smaller body of actual human communication. It breaks down into two eras: the first from an archive covering Sep 2008 to Nov 2018,(i) and the second from an active server covering Nov 2018 to present. The idea is to extract and tally the character sets/encodings,(ii) looking at message bodies and headers (To, From, Subject etc.) separately as the Great Internationalization was inflicted upon them in different ways and possibly at different times. The extraction method is approximate, using regular expressions rather than attempting to parse the formal message structure such as it is.
Results
First, the "old" data set. Sample size in number of messages, as per find ~/Maildir -path '*/cur/*' | wc -l
, is 68042.
Extraction:
$ find ~/Maildir -path '*/cur/*' -print0 | xargs -0 grep -ih '^content-type:' | grep -io 'charset=[^ >;/]*' > bodies1 $ find ~/Maildir -path '*/cur/*' -print0 | xargs -0 grep -ho '=?.*?.*?.*?=' > headers1
Filtering first for a Content-Type header cuts out a lot of charset declarations embedded within HTML, which I certainly hope won't be getting parsed at the IMAP server layer. It does however miss some cases where the header line was wrapped. The number of body charset declarations thus found:
$ wc -l bodies1 84838 bodies1
This can be greater than the total message count because it's given per part in multipart (MIME) messages.
Trimming, case folding and tallying:
$ cat bodies1 | tr -d '"' | tr A-Z a-z | sed 's/charset=//' | sort | uniq -c | sort -n 1 1 cp-850 1 iso-8859-6 1 latin1 1 uft-8(iii) 1 x-unknown(iv) 2 euc-kr 4 gbk 4 iso-8859-2 7 ansi_x3.4-1968 7 unicode-1-1-utf-7 7 windows-1250 15 iso-2022-jp 17 koi8-r 18 big5 22 iso-8859-7 25 3dutf-8(v) 26 windows-1256 34 3d" 39 gb2312 39 windows-1251 46 iso-8859-15 47 ascii(vi) 55 cp1252 2048 windows-1252 19802 us-ascii 28586 iso-8859-1 33982 utf-8(vii)
Number of charset-laden headers found (they snuck this in on top of the standards which clearly stated ASCII, using =? ?= bracketing because nobody would ever use those characters!!):
$ wc -l headers1 14961 headers1
Trimming, case folding and tallying:
$ cut -d? -f2 headers1 | tr A-Z a-z | sort | uniq -c | sort -n 1 cp1252 1 gbk 3 windows-1256 4 iso-8859-7 13 windows-1251 15 iso-2022-jp 56 koi8-r 128 iso-8859-1 134 windows-1252 613 us-ascii(viii) 1077 gb2312 12916 utf-8
Moving on to the "new" set, the sample is 40745 messages, 40676 body charset declarations and 30279 charset-laden headers. It would appear that the prevalence of deviant headers has greatly increased between the two time periods, though other explanations are possible such as an increased proportion of retained spam to human use in the new set.
Bodies:
$ cat bodies2 | tr -d '"' | tr A-Z a-z | sed 's/charset=//' | sort | uniq -c | sort -n 1 3dus-ascii(ix) 1 3dut=46-8 1 iso-8859-10 1 unicode-1-1-utf-7 1 utf-16le 1 utf-8(x) 1 utf8 1 windows-1250 2 euc-jp 2 iso-8859-3 2 iso-8859-7 3 euc-kr 3 iso-8859-5 4 big5 4 iso-8859-14 9 ibm852 17 ascii 20 cp-850 22 iso-2022-jp 41 gbk 154 gb2312 547 windows-1251 582 iso-8859-2 631 iso-8859-15 1189 iso-8859-1 1349 windows-1252 7080 us-ascii 29007 utf-8
Headers:
$ cut -d? -f2 headers2 | tr A-Z a-z | sort | uniq -c | sort -n 1 ¶¡c¼Ñ(xi) 2 iso-8859-2 3 shift_jis 4 iso-8859-7 7 gb18030 12 gbk 15 iso-8859-5 16 iso-8859-15 64 iso-2022-jp 204 windows-1252 280 gb2312 366 windows-1251 1217 iso-8859-1 11294 us-ascii 16794 utf-8
Charset support in musl iconv
For the other side of the comparison we'll need to see what out of this mess is supported by musl. As they describe it:
The iconv implementation musl is very small and oriented towards being unobtrusive to static link. Its character set/encoding coverage is very strong for its size, but not comprehensive like glibc’s.
and
Many legacy double-byte and multi-byte East Asian encodings are supported only as the source charset, not the destination charset. JIS-based ones are supported as the destination as of version 1.1.19.
I expect it's only the decoding side that matters here, i.e. converting from whatever source charset to unicode.
However, I couldn't find the exact list of supported charsets anywhere, so I've extracted it from the source, which is the charmaps
variable defined in src/locale/iconv.c and src/locale/codepages.h, as of the version in the current Gales tree. Here they all are, with aliases listed on the same line. I take it they normalize by first removing the optional hyphens.
utf8 char wchart ucs2be ucs2le utf16be utf16le ucs4be utf32be ucs4le utf32le ascii usascii iso646 iso646us utf16 ucs4 utf32 ucs2 eucjp shiftjis sjis iso2022jp gb18030 gbk gb2312 big5 bigfive cp950 big5hkscs euckr ksc5601 ksx1001 cp949 iso88591 latin1 iso88592 iso88593 iso88594 iso88595 iso88596 iso88597 iso88598 iso88599 iso885910 iso885911 tis620 iso885913 iso885914 iso885915 latin9 iso885916 cp1250 windows1250 cp1251 windows1251 cp1252 windows1252 cp1253 windows1253 cp1254 windows1254 cp1255 windows1255 cp1256 windows1256 cp1257 windows1257 cp1258 windows1258 koi8r koi8u cp437 cp850 cp866 ibm1047 cp1047
The utf7 business that prompted this in the first place shows up only in the very odd looking unicode-1-1-utf-7
, which isn't in glibc iconv either. It shows up in "Delivery Status Notification (Failure)" messages (bounces), most from the Postfix MTA. The actual text of the relevant parts looks all ASCII to me.
The remaining plausible encodings not found in the list:
ibm852 - a Central European DOS codepage that didn't make it into the ISO-8859 list. Found only in a specific strain of spam (I am a hacker who has access to your account, send me Bitcoins!)
ANSI_X3.4-1968 - yet another name for ASCII, found in mail from a cron daemon and one corporate sender. This *is* recognized by glibc iconv so may be worth adding to musl as an alias.
- I have older stuff too, but a bit too buried in different places and formats to bother digging up for this. [^]
- Supposedly unicode is an abstract character set while utf8 and friends are the concrete byte-encodings thereof, but they all seem to end up labeled "charset". [^]
- Lolz. [^]
- How very helpful of them. [^]
- From "=3D", the quoted-printable encoding of "=" (ASCII 0x3D); perhaps this snuck in from some HTML. [^]
- Using ASCII to tell me it's ASCII, gee thanks. Probably it's because something had to be declared in order to include a further transfer-encoding field. [^]
- Ain't it nice that unicode is dominating and maybe eventually someday we can be rid of all those legacy encodings? Perhaps only if you don't look inside that box, because the froth never went away, just got hidden under a new name with somewhat different complexities to deal with. [^]
- Again, they had to put something in order to base64 or quoted-printable encode it, perhaps to escape some character that would otherwise terminate the header. Which itself is pretty suspicious. [^]
- This and the following are artifacts of the sloppy parsing; they're from a message that was quoted in full raw form in another message so the client figured it better escape stuff. [^]
- This got separated from the main utf-8 pile due to a Ctrl-M (carriage return) character at the end. [^]
- Definitely spam, perhaps h4xx0rz, with .doc payload; the binary garbage comes from a pseudo-header in a nominally plain text part of the message body. [^]
[...] for character set conversion). To be on the safe side, I looked into how iconv was used in dovecot, did some data analysis regarding what character encodings are found in practice in the wild, and checked which ones are [...]
Pingback by The Dovecot reports: how we came to forking a major email server « Fixpoint — 2023-04-07 @ 00:00