bodyfile: extend character escaping for characters special Unicode and non-Unicode characters #77

joachimmetz · 2023-07-04T04:04:09Z

Certain file systems allow for characters that either have a special meaning in Unicode such as U+d800 and/or non-Unicode characters

The extended bodyfile 3 format currently does not specify how to handle these characters. Proposal is to escape such characters as "\u####" and "\U########", preferring the short form over the long form where possible.

Control characters U+1-U+8, U+B-U+C, U+E-U+1F, U+7F-U+84, U+86-U+9F (already covered)
Unicode surrogate characters U+d800-U+dfff - Changes to escape Unicode surrogate codes #78
Undefined Unicode characters - Changes to handle special and non-Unicode characters #77 #95
- U+FDD0-U+FDDF
- U+fffe-U+ffff
- U+1FFFE-U+1FFFF
- U+2FFFE-U+2FFFF
- U+3FFFE-U+3FFFF
- U+4FFFE-U+4FFFF
- U+5FFFE-U+5FFFF
- U+6FFFE-U+6FFFF
- U+7FFFE-U+7FFFF
- U+8FFFE-U+8FFFF
- U+9FFFE-U+9FFFF
- U+AFFFE-U+AFFFF
- U+BFFFE-U+BFFFF
- U+CFFFE-U+CFFFF
- U+DFFFE-U+DFFFF
- U+EFFFE-U+EFFFF
- U+FFFFE-U+FFFFF
- U+10FFFE-U+10FFFF
Other values observed to be not printable - Changes to handle special and non-Unicode characters #77 #95
- U+2028, U+2029, U+E000, U+F8FF, U+F0000, U+FFFFD, U+100000, U+10FFFD

Open questions

What about "Unicode compatibility characters" ?
What about U+110000-U+ffffffff
What about original path uses a specific codepage (encoding), which is converted to Unicode, however that can be encoded into multiple variations of the original encoding e.g. encoding U+2252 to cp932. What if there are 2 paths that decode to the same string? How should the original path be best preserved?
filename contains a path segment separator (e.g. \ or /), if not escaped this leads to ambiguity e.g. if / is a path segment separator is 'test/1234' a single file name or a path ?

A related discussion dfxml-working-group/dfxml_schema#34

Also consider if the format should be extended with a header to specify its encoding?

joachimmetz · 2023-07-06T05:19:35Z

Other values observed to be not printable

U+2028, U+2029, U+E000, U+F8FF, U+F0000, U+FFFFD, U+100000, U+10FFFD

joachimmetz · 2023-07-06T06:15:44Z

Proposal is to escape such characters as "\u####" and "\U########", preferring the short form over the long form where possible.

to prevent issues on case-insensitive file systems it might be better to only use the long form

joachimmetz · 2023-07-07T16:10:31Z

This might be better to do at the dfVFS layer, since a lot of operations on a Python Unicode object with a unpaired surrogate character will raise an exception.

log2timeline/dfvfs#744

Though there are complicating factors with this approach as well that \ should now also be escaped

joachimmetz · 2023-07-07T17:16:01Z

Going with this approach for now #78

joachimmetz · 2023-07-08T09:48:45Z

Looks like Python 3.3 (and later) restrictions are mostly involving surrogates

joachimmetz · 2023-07-20T05:48:48Z

Some additional thoughts https://osdfir.blogspot.com/2023/07/whats-in-file-path.html

joachimmetz self-assigned this Jul 4, 2023

joachimmetz added the enhancement New feature or request label Jul 4, 2023

joachimmetz mentioned this issue Aug 1, 2023

enhance bodyfile output #55

Open

5 tasks

joachimmetz added a commit to joachimmetz/dfimagetools that referenced this issue Jan 28, 2024

Changes to handle special and non-Unicode characters log2timeline#77

5edde69

joachimmetz added a commit that referenced this issue Jan 28, 2024

Changes to handle special and non-Unicode characters #77 (#95)

1a10a07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bodyfile: extend character escaping for characters special Unicode and non-Unicode characters #77

bodyfile: extend character escaping for characters special Unicode and non-Unicode characters #77

joachimmetz commented Jul 4, 2023 •

edited

Loading

joachimmetz commented Jul 6, 2023

joachimmetz commented Jul 6, 2023 •

edited

Loading

joachimmetz commented Jul 7, 2023 •

edited

Loading

joachimmetz commented Jul 7, 2023

joachimmetz commented Jul 8, 2023

joachimmetz commented Jul 20, 2023

bodyfile: extend character escaping for characters special Unicode and non-Unicode characters #77

bodyfile: extend character escaping for characters special Unicode and non-Unicode characters #77

Comments

joachimmetz commented Jul 4, 2023 • edited Loading

joachimmetz commented Jul 6, 2023

joachimmetz commented Jul 6, 2023 • edited Loading

joachimmetz commented Jul 7, 2023 • edited Loading

joachimmetz commented Jul 7, 2023

joachimmetz commented Jul 8, 2023

joachimmetz commented Jul 20, 2023

joachimmetz commented Jul 4, 2023 •

edited

Loading

joachimmetz commented Jul 6, 2023 •

edited

Loading

joachimmetz commented Jul 7, 2023 •

edited

Loading