Numbered SGML entities in header addresses
felixs
besteck455 at gmail.com
Fri Apr 12 09:25:20 UTC 2019
On Thu, Apr 11, 2019 at 07:12:57PM -0500, Derek Martin wrote:
> On Sun, Apr 07, 2019 at 11:13:53PM +0200, felixs wrote:
> > On Fri, Apr 05, 2019 at 11:24:26AM -0700, Ian Zimmerman wrote:
> > > I think this is the first time I got hit by the next stage of
> > > browserisation: on a mailing list, a From: line that looks like
> > >
> > > From: "Foo Bariì" <foo-baric at gmail.com>
> >
> > > where the entity refers to the character U0107 in Unicode code point
>
> FWIW, the quoted entity is the latin-1 character 'ì', not the
> character 'ć'. The latter would be ć, not
> 236... Seems the last two digits were transposed somehow.
>
> > And if you add
> >
> > set charset="utf-8"
>
> You should simply never set charset. Ever. If you need to, it's
> either because your system is misconfigured (so fix that instead), or
> your multi-lingual text input configuration is sufficiently
> complicated that you already know plenty enough about it to ignore
> what I just said. =8^) Setting charset is vastly more likely to
> cause problems, because setting it almost guarantees that you don't
> know what you're doing, and you're Doing It Wrong™.
(...)
Thanks. I had already posted a follow-up on my first message.
> But clearly it won't help at all in this case. The problematic string
> isn't a binary representation of a unicode character. It's an HTML
> entity, and HTML entities in recipient headers is not supported by any
> of the RFCs, AFAIK (although new ones are added all the time, so it's
> hard to be sure)... So the fact that it's there is because some
> misguided web-based e-mail software thinks ignoring e-mail RFCs is
> cool (or more likely, just does not understand i18n).
Event though, call them HTML entities, call them something else, they
are ASCII characters and as such they are a subset of utf-8. That is the
very reason why they are displayed by mutt as they are displayed. Who
said that they are binary representations? I talked about hexadecimal
representation being converted into integer, to make use of chr()
in my python function example. Maybe I cannot follow now...
>
> At any rate, nothing will fix this short of Mutt providing explicit
> support for it, which IMO it should not do, or writing a script that
> can convert it, to be used as a display filter. This is bound to be
> more trouble than it's worth... I'm guessing the least obnoxious
> approach would be to find a script that converts plain text into
> minimally formatted HTML, and then view the resulting thing in w3m or
> some such. But such a thing would likely escape the HTML entities it
> found in the text, in some fashion, since it's assuming that it's
> plain text... Alternatively you'd have to parse the whole file
> looking for HTML entities, and then convert them to the appropriate
> character for the locale you're using. Blech.
Sure, a waste of time.
(...)
Cheers,
felixs
More information about the Mutt-users
mailing list