Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading a single file from s3 with the filesystem resource #2188

Closed
guillaumecherel opened this issue Jan 6, 2025 · 2 comments
Closed

Reading a single file from s3 with the filesystem resource #2188

guillaumecherel opened this issue Jan 6, 2025 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@guillaumecherel
Copy link

dlt version

1.5.0

Describe the problem

Reading a single file with the filesystem resource don’t work.

Targeting the file directly in the bucket_url yields an empty list:

In [2]: import dlt
   ...: from dlt.sources.filesystem import filesystem
   ...: list(filesystem(bucket_url=f"s3://{bucket}/data/folder/my_file.csv"))
Out[2]: []

Passing the filename to the file_glob parameter crashes:

In [1]: import dlt
   ...: from dlt.sources.filesystem import filesystem
   ...: list(filesystem(bucket_url=f"s3://{bucket}/data/folder/", file_glob="my_file.csv"))
...
ResourceExtractionError: In processing pipe filesystem: extraction of resource filesystem in generator filesystem caused an exception: [Errno 22] Bad Request

Adding a * anywhere in the file glob works:

In [1]: import dlt
   ...: from dlt.sources.filesystem import filesystem
   ...: list(filesystem(bucket_url=f"s3://{bucket}/data/folder/", file_glob="my_file.csv*"))
Out[2]: 
[{'file_name': 'my_file.csv',
  'relative_path': 'my_file.csv',
  ...}]

Expected behavior

Either one of the first two methods above or both should return a list with as only element the expected file metadata.

Steps to reproduce

Start a fresh python interpreter and copy / paste the code above. Change the paths accordingly. Restart the interpreter at each attempt.

Operating system

Linux

Runtime environment

Local

Python version

3.11

dlt data source

filesystem

dlt destination

DuckDB

Other deployment details

No response

Additional information

No response

@mohamedmeqlad99
Copy link

he issue might be a bug in the filesystem resource where it doesn't correctly handle direct file paths or specific glob patterns.

A fix could involve updating the resource to correctly parse and handle direct file paths and glob patterns without requiring wildcards.

@rudolfix
Copy link
Collaborator

@guillaumecherel we have a test that globs single files on all filesystems. Could you past the full exception trace? maybe there are some details of failed exception

@rudolfix rudolfix added the bug Something isn't working label Jan 14, 2025
@rudolfix rudolfix self-assigned this Jan 14, 2025
@rudolfix rudolfix moved this from Todo to In Progress in dlt core library Jan 14, 2025
@github-project-automation github-project-automation bot moved this from In Progress to Done in dlt core library Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

3 participants