Post · bonfire.cafe

Three small announcements:
1. RFC 9839, a guide to which Unicode characters you should never use: https://www.rfc-editor.org/rfc/rfc9839.html
2. Blog piece with background and context, “RFC 9839 and Bad Unicode”: https://www.tbray.org/ongoing/When/202x/2025/08/14/RFC9839
3. A little Go library that implements 9839’s exclusion subsets: https://github.com/timbray/RFC9839

#Unicode

d@nny disc@ mc²

@hipsterelectron@circumstances.run replied · 2 months ago

@timbray would be curious as to the rationale for the choice of the "problematic" terminology, as that adjective is famously considered to be so vague as to constitute a sort of "red flag" when deployed in discussions of online propriety. the precise distinction of "never useful text" and "can lead to misbehavior" seems like a useful one, although i'd argue that private-use characters should be included precisely because they can sometimes be valid, so are more likely to show up.

were any alternatives considered for terminology to designate such invalid text characters? "non-assignable" would seem to be much more specific with respect to the "unicode assignables" subset defined in the rfc document.

mathew

@mathew@universeodon.com replied · 2 months ago

@hipsterelectron @timbray Yeah, I'm really unclear on what makes the C0 codes (other than U+0000) "problematic".

I mean, application/json-seq uses ASCII record separators, and I think in general it would be good if *more* data formats used proper separator characters rather than comma, space, tab, tilde, and so on.

Tim Bray

@timbray@cosocial.ca replied · 2 months ago

@mathew @hipsterelectron I look at https://www.unicode.org/charts/PDF/U0000.pdf and aside from \n, \r, and \t, there is a distinct smell of last century. If we were designing json-seq now, people would be asking why not just \n for a separator?

sayrer

@sayrer@mastodon.social replied · 2 months ago

@hipsterelectron @timbray Nah, "problematic" is used correctly here: "constituting or presenting a problem or difficulty."

It does not mean improper or rude.

d@nny disc@ mc²

@hipsterelectron@circumstances.run replied · 2 months ago

@sayrer @timbray agreed on private-use characters since i would not want generic software to exclude them

d@nny disc@ mc²

@hipsterelectron@circumstances.run replied · 2 months ago

@sayrer @timbray the precedent of "noncharacter" from unicode would seem to motivate something closer to the "non-assignable" terminology. if the "problematic" classification is not intended to be referenced from other RFCs, i don't see a problem with it

sayrer

@sayrer@mastodon.social replied · 2 months ago

@hipsterelectron @timbray Yeah, you're just supposed to say "we're using Unicode Assignables" https://www.rfc-editor.org/rfc/rfc9839.html#name-subsets

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances

Bonfire social · 1.0.0-rc.3.21 no JS en

Automatic federation enabled