-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support more additional formats (i.e. fst, feather, etc?) #3
Comments
Relevant issue for fst: fstpackage/fst#9 |
can also consider supporting non-DBI-based backends! e.g. LMDB, see example in https://github.com/cboettig/taxalight/ |
+arrow! |
@1beb can you clarify your use case here? The |
Database <> Parquet would be nice.
…On Wed, Sep 22, 2021, 19:03 Carl Boettiger ***@***.***> wrote:
@1beb <https://github.com/1beb> can you clarify your use case here?
The arrow R package can already stream between many text-based formats
(csv, tsv, json, etc) and 'native' binary formats (feather, parquet, etc)
with minimal memory use. Are you trying to go from an arrow data format to
a DBI-compliant database format or vice-versa?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADG7Y36MMYNUAWF3MRRVODUDJODVANCNFSM4FHKAOUA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Just a note: if you have parquet already, the most recent version of If you have data stuck in some other database (postgres etc) and want to export it to parquet, I hear you, unfortunately, parquet is a format that doesn't take well to processing in chunks (I believe a bunch of metadata is always placed at the end of the parquet file, so you can't stream / append to it, as far as I understand). |
The latter is one of my use cases (db to parquet) my current strategy is breaking into variable based chunks then writing. But when you chunk, often field types get dropped so you have to be explicit. It would be nice to be able to stream to chunk_X.parquet, and be able to specify a row count. I'm trying to avoid writing code for offset/fetch SQL with my bare hands. Was thinking the `ark` function might work if the write method could be parameterized as a function argument. But that doesn't look like how it's designed. Another problem is that this targets an entire table, whereas at my scale I'd need to target subsets of a table partition to fit in RAM. I didn't notice a way to a filter.
What's the target size of DB that this package is generally used for?
PS: Great pointer to duckdb, I'm using it for some production cases and it's really great with parquet!
|
Looking at the code, I'm thinking that here: https://github.com/ropensci/arkdb/blob/master/R/ark.R#L224 you could probably add an optional write to a chunked parquet. You'd need to do some accounting for chunk naming if it mattered to you. I have a need - I'll take a swing at it, if I get it working I'll post it as a PR. |
@1beb sweet. yeah, definitely agree that a type-safe export format would be super nice, and parquet seems the logical choice. in principle, you can |
Ah! I see it now. Great. This looks much more straightforward than I was expecting. Are you aware of a package or script that does a good job of maintaining reasonable defaults between typical SQL schema and R's available types? Read a table, look at the schema, convert to colClasses. Also a type-safe format isn't just nice it's required. Parquet files in a folder can be read with |
👍
Not entirely sure I follow this. It should be built in to the database interface, right? e.g.
I totally here you on this. Still, UNIX philosophy is "plain text is the universal interface" and the permanent archive folks are still all in on it as binary formats which encode types come and go. Date/time encodings in particular have had a rough time of it. The classicist would say: no conflict between logical vs integer if you parse it all as text. ✋ but I hear you, I hear you, I live in the real world too and we need types. I think if we stick with DBI <-> parquet though this is all handled already pretty well though, right? |
Numbering output files instead of random string naming
write_tsv
/read_tsv
for I/O. This should be pluggable.done, now supports readr and base utils table I/O
Would be nice to take advantage of things like faster speed of
fst::read_fst
/fst::write_fst
, but looks likefst
does not support the ability to append, making it impossible to stream in chunks.The text was updated successfully, but these errors were encountered: