commented: I like this information, thank you. Suggest that each section could do with a real world example. commented: Does anyone know if there is a standard like punycode but for encoding arbitrary bytes in unicode? Where some may be valid Unicode and some not? I'm thinking of Pickled values in Python. commented: Well you have URL encoding, where "weird" bytes are encoded as a percent sign followed by two hex digits. That's probably the most widespread. For things which need to be valid symbol names (basically [a-zA-Z0-9_]) I like to use a variation of using underscores for escaping: any byte outside of that range, plus the underscore itself, gets converted to an underscore followed by two hex digits. Always add some prefix (maybe just an underscore, maybe something to "namespace" symbols using your scheme) to ensure the output never starts with a digit. For higher efficiency on non-ASCII bytes, where human readability doesn't matter, you have base64. commented: There isn’t a standard for encoding arbitrary bytes in unicode but there are hacks that implement the idea in various ways. One I know of is base2048 used to pack programs into toots for the BBC Micro bot and its companion Owlet editor. base2048 has an informative rationale in its readme. base2048 is designed according to the weird Twitter/Mastodon per-character cost metrics. In other situations a different metric might make sense, eg the number of bytes in a character’s UTF8 or UTF16 encoding. .