Reimplement path functions around ADT #53

ndmitchell · 2015-12-23T13:54:14Z

Pulling out from #12 (comment), I think most path/drive based functions (but not extension functions) should first parse their path, then modify, then render it. That gives a much smaller "trusted core" of FilePath.

data Lexeme = Drive Char | UNC String | Separator Char | Path String

parse :: String -> [Lexeme]

display :: [Lexeme] -> String
display = concatMap $ \x -> case x of
    Drive x -> [x,':'] -- x will be an ASCII character
    UNC x -> '\\':'\\':x -- x will be a UNC name, not containing any pathSeparators
    Separator x -> x -- x will be a member of pathSeparators
    Path x -> x -- x will not contain any pathSeparators

We would rely on the property display . parse == id, so likely UNC would need extending to say which type of separators it was, whether it had ? etc.

The text was updated successfully, but these errors were encountered:

SwiftsNamesake · 2017-08-31T05:40:34Z

@ndmitchell Any news on this front? Is the intention to have strongly typed paths eventually?

ndmitchell · 2017-08-31T21:50:55Z

No progress on this front at all, not looked at it at all. There is no intention to change the API with this work - it's about cleaning up the core and ensuring consistency.

hasufell · 2021-12-03T23:17:46Z

@ndmitchell I came up with the following. This is for windows only for now, doesn't use a real parser since it's rather small and passes the display (parse x) === x property. Thoughts?

data Lexeme = NS NameSpace
            | Disk Char
            | Device String
            | Share String
            | Separators [Char]
            | FileName String
  deriving Show

data NameSpace = FileNameSpace
               | DeviceNameSpace
               | NTNameSpace
  deriving Show



-- | Parse a filepath into lexemes.
--
-- > display (parse x) === x
parse :: String -> [Lexeme]
parse fp'
  -- '\\?\UNC\share\path'
  | Just (lx, rest) <- parseExtendedUNC fp' = lx         ++ parsePath rest
  -- '\\?\C:\path', '\\.\C:\path', '\??\C:\path'
  | Just (lx1, r1)  <- parseNameSpace fp'
  , Just (lx2, r2)  <- parseDisk r1         = lx1 ++ lx2 ++ parsePath r2
  -- '\\.\COM1'
  | Just (lx1@(NS DeviceNameSpace:_), r1) <- parseNameSpace fp'
  , Just (lx2, r2)  <- parseDevice r1       = lx1 ++ lx2 ++ parsePath r2
  -- '\\?\some\other', '\\.\some\other', '\??\some\other'
  -- detects no device/disk
  | Just (lx, rest) <- parseNameSpace fp'   = lx         ++ parsePath rest
  -- 'C:\path'
  | Just (lx, rest) <- parseDisk fp'        = lx         ++ parsePath rest
  -- '\\share\path'
  | Just (lx, rest) <- parseDriveShare fp'  = lx         ++ parsePath rest
  -- 'relative\path' and everything else
  | otherwise = parsePath fp'
 where
  parsePath :: String -> [Lexeme]
  parsePath fp = 
    case parseFileName fp <|> parseSeparators fp of
      Nothing         -> []
      Just (lx, [])   -> lx
      Just (lx, rest) -> lx ++ parsePath rest

  parseFileName :: String -> Maybe ([Lexeme], String)
  parseFileName fp =
    case break isPathSeparator fp of
      ([], _) -> Nothing
      (a, r)  -> Just ([FileName a], r)

  parseSeparators :: String -> Maybe ([Lexeme], String)
  parseSeparators fp
   | null a = Nothing
   | otherwise = Just ([Separators a], b)
      where (a, b) = span isPathSeparator fp

  -- prefix only
  parseExtendedUNC :: String -> Maybe ([Lexeme], String)
  parseExtendedUNC ('\\':'\\' :'?':'\\':'U':'N':'C':s4:r1) | isPathSeparator s4 = do
    (lx, r2) <- parseSeparators (s4:r1)
    (share, r3) <- parseDriveShareName r2
    pure ([NS FileNameSpace] ++ lx ++ share, r3)
  parseExtendedUNC _ = Nothing
    
  -- prefix only
  parseNameSpace :: String -> Maybe ([Lexeme], String)
  parseNameSpace ('\\':'?' :'?':'\\':rest) = Just ([NS NTNameSpace    ], rest)
  parseNameSpace ('\\':'\\':'?':'\\':rest) = Just ([NS FileNameSpace  ], rest)
  parseNameSpace ('\\':'\\':'.':'\\':rest) = Just ([NS DeviceNameSpace], rest)
  parseNameSpace _                         = Nothing

  -- prefix or after any NameSpace
  parseDisk :: String -> Maybe ([Lexeme], String)
  parseDisk (x:':':rest) | isLetter x = Just ([Disk x], rest)
  parseDisk _ = Nothing

  parseDevice :: String -> Maybe ([Lexeme], String)
  parseDevice fp =
    case break isPathSeparator fp of
      ([], _) -> Nothing
      (a, r)  -> Just ([Device a], r)

  -- \\sharename\
  parseDriveShare :: String -> Maybe ([Lexeme], String)
  parseDriveShare (s1:s2:xs) | isPathSeparator s1 && isPathSeparator s2 = do
    (lx, rest) <- parseDriveShareName xs
    pure ([Separators [s1, s2]] ++ lx, rest)
  parseDriveShare _ = Nothing

  -- assume you have already seen \\
  -- share\bob -> "share\", "bob"
  parseDriveShareName :: String -> Maybe ([Lexeme], String)
  parseDriveShareName name = do
    case break isPathSeparator name of
      ([], _) -> Nothing
      (a, b)  -> let (lx, r) = optional b parseSeparators
                 in Just ([Share a] ++ lx, r)

  optional :: String -> (String -> Maybe ([Lexeme], String)) -> ([Lexeme], String)
  optional input parser = maybe ([], input) id (parser input)




display :: [Lexeme] -> String
display lxs = case lxs of
                (ns@(NS FileNameSpace):sep@(Separators _):share@(Share _):rest) -> 
                  d ns ++ "UNC" ++ d sep ++ d share ++ display rest
                other -> concatMap d other
 where
  d :: Lexeme -> String
  d (NS FileNameSpace)   = "\\\\?\\"
  d (NS DeviceNameSpace) = "\\\\.\\"
  d (NS NTNameSpace)     = "\\??\\"
  d (Disk c)             = c:":"
  d (Device dev)         = dev
  d (Share share)        = share
  d (Separators sep)     = sep
  d (FileName fn)        = fn

Mistuke · 2021-12-04T00:29:27Z

@hasufell that looks great and handles all the paths I'd expected. The only question I have is how does it handle / as a separator. / is a valid path separator in a on-namespaced path but isn't in a namespace. i.e \\?\C:\foo/bar refers to a file called foo/bar.

Also /foo\bar is a valid relative path, leading / are relative on Windows.

This next one is arcane so feel free not to support it, but on a non-hierarchical path the \ after the : for the drive letter is optional. So c:foo.txt and c:foo are valid but c:foo\bar.txt and c:foo\bar aren't.

Lastly I'll leave it up to you whether to support file streams or not, they a they aren't very common https://docs.microsoft.com/en-us/windows/win32/fileio/file-streams

hasufell · 2021-12-04T12:50:48Z

@Mistuke

Lastly I'll leave it up to you whether to support file streams or not, they a they aren't very common https://docs.microsoft.com/en-us/windows/win32/fileio/file-streams

That actually makes sense in the light of a security vulnerability, which broke the handling of file extensions.

E.g. current filepath behavior is this (which is incorrect):

> splitExtension "lol.txt::$DATA"
("lol",".txt::$DATA")

There'll be some questions on how to preserve the uncurry addExtension (splitExtension x) == x' property.

This next one is arcane so feel free not to support it, but on a non-hierarchical path the \ after the : for the drive letter is optional. So c:foo.txt and c:foo are valid but c:foo\bar.txt and c:foo\bar aren't.

Right... but the lexer should probably parse it anyway. And then we reject it in e.g. isValid?

EDIT: I tested it and c:foo\bar.txt is the same as foo\bar.txt on my windows 10.

/ is a valid path separator in a on-namespaced path but isn't in a namespace. i.e \?\C:\foo/bar refers to a file called foo/bar.

Right... but it seems both separators are valid in \\.\C:\foo/bar, so it seems the only difference between \\?\ and \\.\ is that the latter does do expansion etc. I'm not sure about \??\ (see below).

Other things I'm not sure about/still testing:

NT object namespace doesn't work at all for me, e.g. \??\C: throws invalid argument (tried readFile/writeFile/listDirectory, ...), although the WinObj manager shows under GLOBAL?? that this symlink exists (also see this SU answer: https://superuser.com/a/1096784/854039)... I can access all these symlinks via \\?\ and \\.\, though, which is rather confusing
The use of Device lexeme, which is supposed to parse COM1 in '\\.\COM1'... seems to also work with file namespace \\?\COM1. Generally, I can access all the symlinks from the WinObj manager with these two. So it seems we can expect either a disk char or a device string after a namespace prefix.
Silly paths like '\\.\COM1\lol\bar' still seem to correctly access the COM1 port...
Generally our notion of drive is this (correct?):
- on linux, there's only one drive, which is the root folder /
- on windows, a drive may be either:
  - a Disk char (e.g. C:)
  - a device string (e.g. COM1)
  - a share name (e.g. \\localhost)
Do any absolute paths exist that don't have a drive as described in 4.?

hasufell · 2021-12-04T16:19:17Z

NT object namespace doesn't work at all for me, e.g. \??\C: throws invalid argument (tried readFile/writeFile/listDirectory, ...), although the WinObj manager shows under GLOBAL?? that this symlink exists (also see this SU answer: https://superuser.com/a/1096784/854039)... I can access all these symlinks via \\?\ and \\.\, though, which is rather confusing

The use of Device lexeme, which is supposed to parse COM1 in '\\.\COM1'... seems to also work with file namespace \\?\COM1. Generally, I can access all the symlinks from the WinObj manager with these two. So it seems we can expect either a disk char or a device string after a namespace prefix.

I have digged deeper into this and it appears that:

\\?\, \\.\ and \??\ all access the same NT object namespace! (here's a picture showing the store https://imgur.com/a/l5KlE38)
\??\ is broken under the directory package, but works with the Win32 API
There are only subtle differences between the three forms, e.g. described in https://stackoverflow.com/a/25099634
\\?\UNC is in no way special syntax, it's just another symlink under the GLOBAL?? namespace
you can escape the GLOBAL?? namespace and write something like \\?\GLOBALROOT\GLOBAL??\UNC\localhost\c$\foo\bar... which is equivalent to \\localhost\c$\foo\bar. Or even "\\\\?\\GLOBALROOT\\Device\\Harddisk0\\Partition2\\temp\\lol"

As such... I'm not sure there's a reasonable way to define what a Drive is in light of these possible expressions.

Also relevant: https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-getvolumepathnamew

hasufell · 2021-12-04T18:13:22Z

Split-drive behavior comparison to python:

python:
  "\\localhost\c$\foo\bar"       -> ("\\localhost\c$", "\foo\bar")
  "\\?\C:\foo\bar"               -> ("\\?\C:", "\foo\bar")
  "\\.\C:\foo\bar"               -> ("\\.\C:", "\foo\bar")
  "\\?\UNC\localhost\c$\foo\bar" -> ("\\?\UNC", "\localhost\c$\foo\bar")
  "\\.\UNC\localhost\c$\foo\bar" -> ("\\.\UNC", "\localhost\c$\foo\bar")

Haskell:
  "\\localhost\c$\foo\bar"       -> ("\\localhost\", "c$\foo\bar")
  "\\?\C:\foo\bar"               -> ("\\?\C:\", "foo\bar")
  "\\.\C:\foo\bar"               -> ("\\.\C:\", "foo\bar")
  "\\?\UNC\localhost\c$\foo\bar" -> ("\\?\\UNC\localhost\", "c$\foo\bar")
  "\\.\UNC\localhost\c$\foo\bar" -> ("\\.\","UNC\localhost\c$\foo\bar")

Mistuke · 2021-12-04T18:48:29Z

There'll be some questions on how to preserve the uncurry addExtension (splitExtension x) == x' property.

Probably can't be maintained for non-default streams.

EDIT: I tested it and c:foo\bar.txt is the same as foo\bar.txt on my windows 10.

Indeed, I was wrong, forgot that it becomes a relative path. So they're all valid. Just not absolute paths.

Right... but it seems both separators are valid in \.\C:\foo/bar, so it seems the only difference between \?\ and \.\ is that the latter does do expansion etc.

Yes with valid I didn't mean you get an error, I meant it doesn't return what a user might expect. It also returns something the user can't remove using any normal OS tools like Explorer.

Hmm I don't think the \\?\ namespace has no obligation to respect reserved names, while \\.\ does. So devices should really only be used with \\.\ https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file#namespaces

\?, \.\ and ??\ all access the same NT object namespace! (here's a picture showing the store https://imgur.com/a/l5KlE38)

Yes, eventually they all access the same kernel objects, but the way there differs.

\?\UNC is in no way special syntax, it's just another symlink under the GLOBAL?? namespace

It's only special in that it's to replace the definition of the original \\ syntax for network access. While the implementation is just a symlink, there's no other way to access a UNC path without it. So it is still in affect special. You probably want to treat it special such that you'd be able to distinguish between a local path or a network one.

you can escape the GLOBAL?? namespace and write something like \?\GLOBALROOT\GLOBAL??\UNC\localhost\c$\foo\bar... which is equivalent to \localhost\c$\foo\bar. Or even "\\?\GLOBALROOT\Device\Harddisk0\Partition2\temp\lol"

Or Volume{..}, etc, but this doesn't change the behavior of the lexer does it?

As such... I'm not sure there's a reasonable way to define what a Drive is in light of these possible expressions.

I think a drive should stay what 99% of users expect, namely just the user level object mappings for local paths. e.g. C:\ for C:\fooo and perhaps for \\?\C:\.

I would expect most users to use the well known PATH syntax and these differences are only important for GHC or specialized program where I assume the programmer knows what they're doing and so wouldn't call

Split-drive behavior comparison to python:

python:
  "\\localhost\c$\foo\bar"       -> ("\\localhost\c$", "\foo\bar")
  "\\?\C:\foo\bar"               -> ("\\?\C:", "\foo\bar")
  "\\.\C:\foo\bar"               -> ("\\.\C:", "\foo\bar")

The one makes more sense to me than the Haskell one,

"\?\UNC\localhost\c$\foo\bar" -> ("\?\UNC", "\localhost\c$\foo\bar")
"\.\UNC\localhost\c$\foo\bar" -> ("\.\UNC", "\localhost\c$\foo\bar")

These don't make sense to me, I would have expected \\?\\UNC\localhost\c$ as that's the actual "root" of the object you need to access.

Haskell:
"\localhost\c$\foo\bar" -> ("\localhost", "c$\foo\bar")
"\?\C:\foo\bar" -> ("\?\C:", "foo\bar")
"\.\C:\foo\bar" -> ("\.\C:", "foo\bar")
"\?\UNC\localhost\c$\foo\bar" -> ("\?\UNC\localhost", "c$\foo\bar")
"\.\UNC\localhost\c$\foo\bar" -> ("\.","UNC\localhost\c$\foo\bar")

hmm somewhat inconsistent

hasufell · 2021-12-04T20:03:37Z

Probably can't be maintained for non-default streams.

What if we do this:

> splitExtension "lol.txt::$DATA"
("lol::$DATA",".txt")
> addExtension "lol::$DATA" ".txt"
"lol.txt::$DATA"

I think that's somewhat consistent. So everything after the first : is considered a Data stream identifier.

Yes with valid I didn't mean you get an error, I meant it doesn't return what a user might expect. It also returns something the user can't remove using any normal OS tools like Explorer.

Hmm I don't think the \?\ namespace has no obligation to respect reserved names, while \.\ does. So devices should really only be used with \.\ https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file#namespaces

To clarify: createDirectory "\\\\.\\C:\\foo/bar" creates the directory C:\foo\bar here... that's what I meant. So \\.\ seems to allow path interpretation by the windows API. As a result, I'd only ignore unix / path separators for file namespace \\?\.

The rest of the intricacies can be left to the user, who chose to go down this path.

Or Volume{..}, etc, but this doesn't change the behavior of the lexer does it?

Well, yes potentially... because the question is what we consider a Drive.

I think there are a couple of choices:

only ever consider the first path component as the drive (the simplest, will be a breaking change):
- \\.\C:\foo\bar -> \\.\C:\
- \\.\UNC\localhost\c$\foo\bar -> \\.\UNC\
- \\.\GLOBALROOT\Device\Harddisk0\Partition2\foo\bar -> \\.\GLOBALROOT\
same as 1., except that we treat UNC specially (probably the most backwards compatible way):
- \\.\C:\foo\bar -> \\.\C:\
- \\.\UNC\localhost\c$\foo\bar -> \\.\UNC\localhost\
- \\.\GLOBALROOT\Device\Harddisk0\Partition2\foo\bar -> \\.\GLOBALROOT\
try to interpret as much as possible (we'll probably never be exhaustive):
- \\.\C:\foo\bar -> \\.\C:\
- \\.\UNC\localhost\c$\foo\bar -> \\.\UNC\localhost\
- \\.\GLOBALROOT\Device\Harddisk0\Partition2\foo\bar -> \\.\GLOBALROOT\Device\Harddisk0\Partition2\

The one makes more sense to me than the Haskell one,

Yes, I think that our current "\\localhost\c$\foo\bar" -> ("\\localhost\", "c$\foo\bar") is somewhat confusing, but fixing that would be a breaking change.

Mistuke · 2021-12-05T07:32:47Z

What if we do this:
> splitExtension "lol.txt::$DATA"
("lol::$DATA",".txt")
> addExtension "lol::$DATA" ".txt"
"lol.txt::$DATA"
I think that's somewhat consistent. So everything after the first : is considered a Data stream identifier.

That makes some sense, but needs to be documented, it may not be what one intuitively expects.

I think there are a couple of choices:

1. only ever consider the first path component as the drive (the simplest, will be a breaking change):
   
   * `\\.\C:\foo\bar` -> `\\.\C:\`
   * `\\.\UNC\localhost\c$\foo\bar` -> `\\.\UNC\`
   * `\\.\GLOBALROOT\Device\Harddisk0\Partition2\foo\bar` -> `\\.\GLOBALROOT\`

2. same as 1., except that we treat `UNC` specially (probably the most backwards compatible way):
   
   * `\\.\C:\foo\bar` -> `\\.\C:\`
   * `\\.\UNC\localhost\c$\foo\bar` -> `\\.\UNC\localhost\`
   * `\\.\GLOBALROOT\Device\Harddisk0\Partition2\foo\bar` -> `\\.\GLOBALROOT\`

3. try to interpret as much as possible (we'll probably never be exhaustive):
   
   * `\\.\C:\foo\bar` -> `\\.\C:\`
   * `\\.\UNC\localhost\c$\foo\bar -> \\.\UNC\localhost\`
   * `\\.\GLOBALROOT\Device\Harddisk0\Partition2\foo\bar` -> `\\.\GLOBALROOT\Device\Harddisk0\Partition2\`

I find 3. the most logical. Handling the common cases should get the most bang for bucks. I would expect very little, if any programs to use \\.\GLOBALROOT\ etc.

But I personally think that it should treat the first $ the same as the first :. That seems conceptually more correct, even if a
non-backwards compatible change.

>        * `\\.\UNC\localhost\c$\foo\bar -> \\.\UNC\localhost\`

To me says Store in the foo\bar folder on the C drive on localhost. So logically I expect the "drive" to contain c$ since
drive + newpath should still make sense and more consistent with \\.\C:\foo\bar. I really consider the old behavior a bug here.

With this inconsistency a user would need to know whether the patch is a local or UNC one when they call something that returns the drive, in order to get the c$ as well.

The one makes more sense to me than the Haskell one,

Yes, I think that our current "\\localhost\c$\foo\bar" -> ("\\localhost\", "c$\foo\bar") is somewhat confusing, but fixing that would be a breaking change.

Wouldn't it make more sense to release it as a major version and take the breaking change? or a new name for the new behavior? Could always ask what people think about it.

hasufell · 2021-12-05T15:43:19Z

Alright... I tried to come up with a lax grammar around windows filepaths, based on the UNC grammar from MS docs: https://gist.github.com/hasufell/ce7c18ef3ad44d1d59b548206cdcb59d

ndmitchell · 2021-12-12T20:46:25Z

For info, @hasufell, I never expected this to use a real parser (and for dependency reasons it probably won't) - your notes seem plausible along the lines of what I was thinking, but if the ADT parser works better that seems reasonable too. This was more a direction of travel I thought would be good than precise thoughts.

hasufell · 2021-12-13T09:10:42Z

For info, @hasufell, I never expected this to use a real parser (and for dependency reasons it probably won't) - your notes seem plausible along the lines of what I was thinking, but if the ADT parser works better that seems reasonable too. This was more a direction of travel I thought would be good than precise thoughts.

I just handrolled a parser that will work for both [Char] and [Word8]/[Word16] (the latter will be relevant for the AFPP), see https://github.com/haskell/filepath/pull/99/files

ndmitchell added the ghc-8.2 label Dec 23, 2015

This was referenced Dec 23, 2015

Improve normalise #12

Open

Make Windows.combine take account of partial-relative paths #8

Open

hasufell added the discussion label Nov 25, 2021

hasufell mentioned this issue Dec 4, 2021

Improve makeValid/isValid further (on windows) #92

Open

hasufell linked a pull request Dec 8, 2021 that will close this issue

Firs try of ADT parser #99

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplement path functions around ADT #53

Reimplement path functions around ADT #53

ndmitchell commented Dec 23, 2015

SwiftsNamesake commented Aug 31, 2017

ndmitchell commented Aug 31, 2017

hasufell commented Dec 3, 2021 •

edited

Loading

Mistuke commented Dec 4, 2021 •

edited

Loading

hasufell commented Dec 4, 2021 •

edited

Loading

hasufell commented Dec 4, 2021 •

edited

Loading

hasufell commented Dec 4, 2021 •

edited

Loading

Mistuke commented Dec 4, 2021

hasufell commented Dec 4, 2021

Mistuke commented Dec 5, 2021 •

edited

Loading

hasufell commented Dec 5, 2021 •

edited

Loading

ndmitchell commented Dec 12, 2021

hasufell commented Dec 13, 2021

Reimplement path functions around ADT #53

Reimplement path functions around ADT #53

Comments

ndmitchell commented Dec 23, 2015

SwiftsNamesake commented Aug 31, 2017

ndmitchell commented Aug 31, 2017

hasufell commented Dec 3, 2021 • edited Loading

Mistuke commented Dec 4, 2021 • edited Loading

hasufell commented Dec 4, 2021 • edited Loading

hasufell commented Dec 4, 2021 • edited Loading

hasufell commented Dec 4, 2021 • edited Loading

Mistuke commented Dec 4, 2021

hasufell commented Dec 4, 2021

Mistuke commented Dec 5, 2021 • edited Loading

hasufell commented Dec 5, 2021 • edited Loading

ndmitchell commented Dec 12, 2021

hasufell commented Dec 13, 2021

hasufell commented Dec 3, 2021 •

edited

Loading

Mistuke commented Dec 4, 2021 •

edited

Loading

hasufell commented Dec 4, 2021 •

edited

Loading

hasufell commented Dec 4, 2021 •

edited

Loading

hasufell commented Dec 4, 2021 •

edited

Loading

Mistuke commented Dec 5, 2021 •

edited

Loading

hasufell commented Dec 5, 2021 •

edited

Loading