Discussion
Loading...

Post

  • About
  • Code of conduct
  • Privacy
  • Users
  • Instances
  • About Bonfire
Tim Bray
@timbray@cosocial.ca  ·  activity timestamp last month

Three small announcements:
1. RFC 9839, a guide to which Unicode characters you should never use: https://www.rfc-editor.org/rfc/rfc9839.html
2. Blog piece with background and context, “RFC 9839 and Bad Unicode”: https://www.tbray.org/ongoing/When/202x/2025/08/14/RFC9839
3. A little Go library that implements 9839’s exclusion subsets: https://github.com/timbray/RFC9839

#Unicode

  • Copy link
  • Flag this post
  • Block
d@nny disc@ mc²
@hipsterelectron@circumstances.run replied  ·  activity timestamp last month
@timbray would be curious as to the rationale for the choice of the "problematic" terminology, as that adjective is famously considered to be so vague as to constitute a sort of "red flag" when deployed in discussions of online propriety. the precise distinction of "never useful text" and "can lead to misbehavior" seems like a useful one, although i'd argue that private-use characters should be included precisely because they can sometimes be valid, so are more likely to show up.

were any alternatives considered for terminology to designate such invalid text characters? "non-assignable" would seem to be much more specific with respect to the "unicode assignables" subset defined in the rfc document.

  • Copy link
  • Flag this comment
  • Block
mathew
@mathew@universeodon.com replied  ·  activity timestamp last month
@hipsterelectron @timbray Yeah, I'm really unclear on what makes the C0 codes (other than U+0000) "problematic".

I mean, application/json-seq uses ASCII record separators, and I think in general it would be good if *more* data formats used proper separator characters rather than comma, space, tab, tilde, and so on.

  • Copy link
  • Flag this comment
  • Block
Tim Bray
@timbray@cosocial.ca replied  ·  activity timestamp last month
@mathew @hipsterelectron I look at https://www.unicode.org/charts/PDF/U0000.pdf and aside from \n, \r, and \t, there is a distinct smell of last century. If we were designing json-seq now, people would be asking why not just \n for a separator?
  • Copy link
  • Flag this comment
  • Block
sayrer
@sayrer@mastodon.social replied  ·  activity timestamp last month
@hipsterelectron @timbray Nah, "problematic" is used correctly here: "constituting or presenting a problem or difficulty."

It does not mean improper or rude.

  • Copy link
  • Flag this comment
  • Block
d@nny disc@ mc²
@hipsterelectron@circumstances.run replied  ·  activity timestamp last month
@sayrer @timbray agreed on private-use characters since i would not want generic software to exclude them
  • Copy link
  • Flag this comment
  • Block
d@nny disc@ mc²
@hipsterelectron@circumstances.run replied  ·  activity timestamp last month
@sayrer @timbray the precedent of "noncharacter" from unicode would seem to motivate something closer to the "non-assignable" terminology. if the "problematic" classification is not intended to be referenced from other RFCs, i don't see a problem with it
  • Copy link
  • Flag this comment
  • Block
sayrer
@sayrer@mastodon.social replied  ·  activity timestamp last month
@hipsterelectron @timbray Yeah, you're just supposed to say "we're using Unicode Assignables" https://www.rfc-editor.org/rfc/rfc9839.html#name-subsets
  • Copy link
  • Flag this comment
  • Block
Log in

bonfire.cafe

A space for Bonfire maintainers and contributors to communicate

bonfire.cafe: About · Code of conduct · Privacy · Users · Instances
Bonfire social · 1.0.0-rc.3.5 no JS en
Automatic federation enabled
  • Explore
  • About
  • Members
  • Code of Conduct
Home
Login