Storing, retrieving and using files in S3 is a regular activity so it should be easy. It should also ...
- stream the data
- have an api that is python file-io like
- handle some of the desearization and compression stuff because why not
pip install s3-streaming
Opening and reading S3 objects is similar to regular python io. The only difference is that you need to provide a
boto3.session.Session
instance to handle the bucket access.
import boto3
from s3streaming import s3_open
with s3_open('s3://bucket/key', boto_session=boto3.session.Session()) as f:
for next_line in f:
print(next_line)
Consider a file that is gzip
compressed and contains lines of json
. There's some boilerplate in dealing with that,
but why bother? Just handle that in stream.
from s3streaming import s3_open, deserialize, compression
reader_settings = dict(
boto_session=boto3.session.Session(),
deserializer=deserialize.json_lines,
compression=compression.gzip
)
with s3_open('s3://bucket/key.gzip', **reader_settings) as f:
for next_line in f:
print(next_line.keys()) # because the file was decompressed ...
print(next_line.values()) # ... and the json is now a loaded dict!
Other deserialize
options include
csv
csv_as_dict
tsv
tsv_as_dict
string