Proposal for dbfs file access api with mocked test suite #3236
+263
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes
scan-tables-in-mounts
.Why this PR?
Refactoring
scan-tables-in-mounts
to enhance performance will need to leverage parallelized crawling ofdbfs:/mnt/
locations. @nfx suggested using backend Hadoop libraries throughpy4j
for this through theSparkSession
. Since this is a fairly big change, I wanted to make sure we are aligned on the backend file lister before I proceed to modify any existing files.Enable
dbfs
file listingThis PR proposes the
DbfsFiles
class which uses theSparkSession
's java backend withpy4j
to leverage the following Hadoop libraries for efficientdbfs:/
(andmount
) file access:org.apache.hadoop.fs.FileSystem
org.apache.hadoop.fs.Path
As of now, the only useful method is
list_dir
, but if the approach is confirmed I plan to add crawling capabilities leveragingdatabricks.labs.blueprint.parallel
.Test strategy
A testing strategy was painstakingly created to:
scan-tables-in-mounts
. I created a simple trie-based mock file system which can be leveraged forscan-tables-in-mounts
functionality after the refactor. This mock file system also has it's own unit tests.Functionality
databricks labs ucx ...
...
...
Tests