-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for pymongoarrow schema and projection parameters in the mongo CollectionLoaders #577
Comments
Careful about using We tried to write |
@esciara thanks for the heads up! We have indeed noticed similar type-related issues in the past (e.g. mongodb-labs/mongo-arrow#236 (comment)). For ObjectId columns like _id, we are able to use dlt to move data from mongo to duckdb as follows:
I have not tried to see what happens with arrays of ObjectIds, but as you noticed this might be tricky. |
Fab @marcofraccaro. Does it also handle well ObjectId within structures or lists and translates them to string ? |
@esciara we have not tried this as it's not needed for our current use case. However based on the limitations of pymongoarrow we both encountered it might not work out of the box. |
@esciara @marcofraccaro we spent a lot of time trying to deal with object ids without looping in python, that included abusing arrow function for string decoding and trying to convert that column to pandas etc. indeed, @marcofraccaro projections can be definitely added. I hope you were able to hack some kind of solution in the meantime... |
TODO summary:
|
Source name
mongodb
Describe the data you'd like to see
There are 2 parameters that would be useful to be user-configurable in the mongo
CollectionLoaders
:projection
parameter forfind_raw_batches
/find
, which allows to optionally limit which data will be exported from mongo (e.g. to remove at the source columns with sensitive data/reduce data size if not all columns are needed)pymongoarrow_schema
to be used in PyMongoArrowContext to enforce a schema inprocess_bson_stream
in case it is needed. This means that instead of the current call with the schema set toNone
as done incontext = PyMongoArrowContext.from_schema(None, codec_options=self.collection.codec_options)
, one would be able to use a pymongoarrow schema:data_item_format = "arrow"
fails with the errorextraction of resource transaction in generator collection_documents caused an exception: value too large to convert to int32_t
. This error is due to the fact that the schema is wrongly inferred to be int32, but setting pyarrow typepa.float64()
in thepymongoarrow_schema
things work as expectedAre you a dlt user?
I'm considering using dlt, but this bug is preventing this.
Do you ready to contribute this extension?
Yes, I'm ready.
dlt destination
duckdb/s3
Additional information
No response
The text was updated successfully, but these errors were encountered: