-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bodyfile: extend character escaping for characters special Unicode and non-Unicode characters #77
Comments
Other values observed to be not printable U+2028, U+2029, U+E000, U+F8FF, U+F0000, U+FFFFD, U+100000, U+10FFFD |
to prevent issues on case-insensitive file systems it might be better to only use the long form |
This might be better to do at the dfVFS layer, since a lot of operations on a Python Unicode object with a unpaired surrogate character will raise an exception. Though there are complicating factors with this approach as well that \ should now also be escaped |
Going with this approach for now #78 |
Looks like Python 3.3 (and later) restrictions are mostly involving surrogates |
Some additional thoughts https://osdfir.blogspot.com/2023/07/whats-in-file-path.html |
Certain file systems allow for characters that either have a special meaning in Unicode such as U+d800 and/or non-Unicode characters
The extended bodyfile 3 format currently does not specify how to handle these characters. Proposal is to escape such characters as "\u####" and "\U########", preferring the short form over the long form where possible.
Open questions
A related discussion dfxml-working-group/dfxml_schema#34
Also consider if the format should be extended with a header to specify its encoding?
The text was updated successfully, but these errors were encountered: