Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bodyfile: extend character escaping for characters special Unicode and non-Unicode characters #77

Open
4 tasks done
joachimmetz opened this issue Jul 4, 2023 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@joachimmetz
Copy link
Member

joachimmetz commented Jul 4, 2023

Certain file systems allow for characters that either have a special meaning in Unicode such as U+d800 and/or non-Unicode characters

The extended bodyfile 3 format currently does not specify how to handle these characters. Proposal is to escape such characters as "\u####" and "\U########", preferring the short form over the long form where possible.

Open questions

  • What about "Unicode compatibility characters" ?
  • What about U+110000-U+ffffffff
  • What about original path uses a specific codepage (encoding), which is converted to Unicode, however that can be encoded into multiple variations of the original encoding e.g. encoding U+2252 to cp932. What if there are 2 paths that decode to the same string? How should the original path be best preserved?
  • filename contains a path segment separator (e.g. \ or /), if not escaped this leads to ambiguity e.g. if / is a path segment separator is 'test/1234' a single file name or a path ?

A related discussion dfxml-working-group/dfxml_schema#34

Also consider if the format should be extended with a header to specify its encoding?

@joachimmetz joachimmetz self-assigned this Jul 4, 2023
@joachimmetz joachimmetz added the enhancement New feature or request label Jul 4, 2023
@joachimmetz
Copy link
Member Author

Other values observed to be not printable

U+2028, U+2029, U+E000, U+F8FF, U+F0000, U+FFFFD, U+100000, U+10FFFD

@joachimmetz
Copy link
Member Author

joachimmetz commented Jul 6, 2023

Proposal is to escape such characters as "\u####" and "\U########", preferring the short form over the long form where possible.

to prevent issues on case-insensitive file systems it might be better to only use the long form

@joachimmetz
Copy link
Member Author

joachimmetz commented Jul 7, 2023

This might be better to do at the dfVFS layer, since a lot of operations on a Python Unicode object with a unpaired surrogate character will raise an exception.

log2timeline/dfvfs#744

Though there are complicating factors with this approach as well that \ should now also be escaped

@joachimmetz
Copy link
Member Author

Going with this approach for now #78

@joachimmetz
Copy link
Member Author

Looks like Python 3.3 (and later) restrictions are mostly involving surrogates

@joachimmetz
Copy link
Member Author

Some additional thoughts https://osdfir.blogspot.com/2023/07/whats-in-file-path.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant