Add support for pymongoarrow schema and projection parameters in the mongo CollectionLoaders #577

marcofraccaro · 2024-10-16T15:04:35Z

Source name

mongodb

Describe the data you'd like to see

There are 2 parameters that would be useful to be user-configurable in the mongo CollectionLoaders:

A projection parameter for find_raw_batches/find, which allows to optionally limit which data will be exported from mongo (e.g. to remove at the source columns with sensitive data/reduce data size if not all columns are needed)
A pymongoarrow_schema to be used in PyMongoArrowContext to enforce a schema in process_bson_stream in case it is needed. This means that instead of the current call with the schema set to None as done in
context = PyMongoArrowContext.from_schema(None, codec_options=self.collection.codec_options) , one would be able to use a pymongoarrow schema:
```
        pymongoarrow_schema = pymongoarrow.api.Schema(arrow_schema)
        context = PyMongoArrowContext.from_schema(pymongoarrow_schema, codec_options=self.collection.codec_options)
```
Without this schema, in one of our use cases data_item_format = "arrow" fails with the error extraction of resource transaction in generator collection_documents caused an exception: value too large to convert to int32_t. This error is due to the fact that the schema is wrongly inferred to be int32, but setting pyarrow type pa.float64() in the pymongoarrow_schema things work as expected

Are you a dlt user?

I'm considering using dlt, but this bug is preventing this.

Do you ready to contribute this extension?

Yes, I'm ready.

dlt destination

duckdb/s3

Additional information

No response

The text was updated successfully, but these errors were encountered:

esciara · 2024-10-27T19:51:54Z

Careful about using pymongoarrow: we have had a few problems trying to use it, particularly with the translation of ObjectId and arrays of ObjectIds. There is at least the documented problem with nested extension types.

We tried to write pyarrow dataframes using DuckDB's import from Apache Arrow, but it threw an error saying that the type was not supported. We ended up writing pyarrow dataframes straight to parquet files using the pyarrow.parquet.write_table() function, which translates ObjectId to blob (and arrays of blobs respectively), which we then cast using DuckDB's HEX Blob function (which is currently missing from the documentation) to get the id as a string.

marcofraccaro · 2024-10-28T15:51:46Z

@esciara thanks for the heads up! We have indeed noticed similar type-related issues in the past (e.g. mongodb-labs/mongo-arrow#236 (comment)).
pymongoarrow is however still very beneficial to us in terms of performances in several use cases.

For ObjectId columns like _id, we are able to use dlt to move data from mongo to duckdb as follows:

We define a pymongoarrow_schema (as explained in the issue description) where ObjectId columns have type pymongoarrow.types.ObjectIdType()
dlt then transforms these columns to string columns with convert_arrow_columns
Duckdb loads these string columns

I have not tried to see what happens with arrays of ObjectIds, but as you noticed this might be tricky.

esciara · 2024-10-29T06:54:34Z

Fab @marcofraccaro. Does it also handle well ObjectId within structures or lists and translates them to string ?

marcofraccaro · 2024-10-29T10:51:06Z

@esciara we have not tried this as it's not needed for our current use case. However based on the limitations of pymongoarrow we both encountered it might not work out of the box.

rudolfix · 2024-12-04T21:49:43Z

@esciara @marcofraccaro we spent a lot of time trying to deal with object ids without looping in python, that included abusing arrow function for string decoding and trying to convert that column to pandas etc. indeed, duckdb method looks pretty good. also polars can apparently convert series to hex so maybe we can try that.

@marcofraccaro projections can be definitely added. I hope you were able to hack some kind of solution in the meantime...

rudolfix · 2024-12-15T13:25:45Z

TODO summary:

figure out how to convert objectId logical types that Mongo uses into strings that can be easily loaded - without looping in Python. Any str representation will do. OFC the one used by Mongo (https://www.mongodb.com/docs/manual/reference/method/ObjectId/#return-a-hexadecimal-string)
add projection argument to mongodb_collection as in description of this ticket. Please see how that works for nested documents!
add and pymongoarrow_schema pass it to regular and parallel Arrow collection loader

$@marcofraccaro$ marcofraccaro added the source new feature adds new feature to existing source label Oct 16, 2024

github-project-automation bot added this to Verified Sources Oct 16, 2024

github-project-automation bot moved this to Planned in Verified Sources Oct 16, 2024

This was referenced Jan 12, 2025

mongodb: Allow projection and mongoarrow schema #592

Draft

Loosen dependency on pyarrow=="^18.0.0" mongodb-labs/mongo-arrow#259

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for pymongoarrow schema and projection parameters in the mongo CollectionLoaders #577

Add support for pymongoarrow schema and projection parameters in the mongo CollectionLoaders #577

marcofraccaro commented Oct 16, 2024

esciara commented Oct 27, 2024 •

edited

Loading

marcofraccaro commented Oct 28, 2024

esciara commented Oct 29, 2024

marcofraccaro commented Oct 29, 2024

rudolfix commented Dec 4, 2024

rudolfix commented Dec 15, 2024

Add support for pymongoarrow schema and projection parameters in the mongo CollectionLoaders #577

Add support for pymongoarrow schema and projection parameters in the mongo CollectionLoaders #577

Comments

marcofraccaro commented Oct 16, 2024

Source name

Describe the data you'd like to see

Are you a dlt user?

Do you ready to contribute this extension?

dlt destination

Additional information

esciara commented Oct 27, 2024 • edited Loading

marcofraccaro commented Oct 28, 2024

esciara commented Oct 29, 2024

marcofraccaro commented Oct 29, 2024

rudolfix commented Dec 4, 2024

rudolfix commented Dec 15, 2024

esciara commented Oct 27, 2024 •

edited

Loading