From 9a702aef345680022b4b1f03b7928e1eb936b657 Mon Sep 17 00:00:00 2001 From: Suvayu Ali Date: Wed, 29 May 2024 13:22:27 +0200 Subject: [PATCH 1/7] Remove trivial database section from Python guide --- best_practices/language_guides/python.md | 9 --------- 1 file changed, 9 deletions(-) diff --git a/best_practices/language_guides/python.md b/best_practices/language_guides/python.md index dd1f5af0..adecc0a3 100644 --- a/best_practices/language_guides/python.md +++ b/best_practices/language_guides/python.md @@ -300,15 +300,6 @@ It is good practice to restart the kernel and run the notebook from start to fin - [altair](https://github.com/ellisonbg/altair) is a _grammar of graphics_ style declarative statistical visualization library. It does not render visualizations itself, but rather outputs Vega-Lite JSON data. This can lead to a simplified workflow. - [ggplot](https://github.com/yhat/ggpy) is a plotting library imported from R. -### Database Interface - -* [psycopg](https://www.psycopg.org/) is a [PostgreSQL](http://www.postgresql.org) adapter -* [cx_Oracle](http://cx-oracle.sourceforge.net) enables access to [Oracle](https://www.oracle.com/database/index.html) databases -* [monetdb.sql](https://www.monetdb.org/Documentation/SQLreference/Programming/Python) -is [monetdb](https://www.monetdb.org) Python client -* [pymongo](https://pymongo.readthedocs.io) and [motor](https://motor.readthedocs.io) allow for work with [MongoDB](http://www.mongodb.com) database -* [py-leveldb](https://code.google.com/p/py-leveldb/) are thread-safe Python bindings for [LevelDb](https://github.com/google/leveldb) - ### Parallelisation CPython (the official and mainstream Python implementation) is not built for parallel processing due to the [global interpreter lock](https://wiki.python.org/moin/GlobalInterpreterLock). Note that the GIL only applies to actual Python code, so compiled modules like e.g. `numpy` do not suffer from it. From 7dbd016728565630fd85794489a343cf242771ca Mon Sep 17 00:00:00 2001 From: Suvayu Ali Date: Wed, 29 May 2024 17:59:59 +0200 Subject: [PATCH 2/7] Add a chapter on datasets discuss trade-offs between: - local databases like SQLite & DuckDB - data processing libraries like Pandas, Vaex, & Polars Co-authored-by: Flavio Hafner --- best_practices/datasets.md | 52 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 52 insertions(+) create mode 100644 best_practices/datasets.md diff --git a/best_practices/datasets.md b/best_practices/datasets.md new file mode 100644 index 00000000..b56fc6af --- /dev/null +++ b/best_practices/datasets.md @@ -0,0 +1,52 @@ +# Working with large datasets + +There are several solutions available to you as an RSE, with their own pros and cons. You should evaluate which one works best for your project, and project partners, and pick one. Sometimes it might be, that you need to combine two different types of technologies. Here are some examples from our experience. + +You will encounter datasets in various file formats like: +- CSV/Excel +- Parquet +- HDF5/NetCDF +- JSON/JSON-LD + +Or local database files like SQLite. It is important to note, the various trade-offs between these formats. For instance, doing a random seek is difficult with a large dataset for non-binary formats like: CSV, Excel, or JSON. In such cases you should consider formats like Parquet, or HDF5/NetCDF. Non-binary files can also be imported into local databases like SQLite or DuckDB. Below we compare some options to work with datasets in these formats. + +## Local database + +When you have a relational dataset, it is recommended that you use a database. Using local databases like SQLite and DuckDB can be very easy because of no setup requirements. But they come with some some limitations; for instance, multiple users cannot write to the database simultaneously. + +SQLite is a transactional database, so if you have a dataset that is changing with time (e.g. you are adding new rows), it would be more appropriate. However in research often we work with static databases, and are interested mostly in analytical tasks. For such a case, DuckDB is a more appropriate alternative. Between the two, +- DuckDB can also create views (virtual tables) from other sources like files, other databases, but with SQLite you always have to import the data before running any queries. +- DuckDB is multi-threaded. This can be an advantage for large databases, where aggregation queries tend to be faster than sqlite. + - However if you have a really large dataset, say 100Ms of rows, and want to perform a deeply nested query, it would require substantial amount of memory, making it unfeasible to run on personal laptops. + - There are options to customize memory handling, and push what is possible on a single machine. + + You need to limit the memory usage to prevent the operatings system, or shell from preemptively killing it. You can choose a value about 50% of your system's RAM. + ```sql + SET memory_limit = '5GB'; + ``` + By default, DuckDB spills over to disk when memory usage grows beyond the above limit. You can verify the temporary directory by running: + ```sql + SELECT current_setting('temp_directory') AS temp_directory; + ``` + Note, if your query is deeply nested, you should have sufficient disk space for DuckDB to use; e.g. for 4 nested levels of `INNER JOIN` combined with a `GROUP BY`, we observed a disk spill over of 30x the original dataset. However we found this was not always reliable. + + In this kind of borderline cases, it might be possible to address the limitation by splitting the workload into chunks, and aggregating later, or by considering one of the alternatives mentioned below. + - You can also optimize the queries for DuckDB, but that requires a deeper dive into the documentation, and understanding how DuckDB query optimisation works. +- Both databases support setting (unique) indexes. Indexes are useful and sometimes necessary + - For both DuckDB and SQLite, unique indexes allow to ensure data integrity + - For SQLite, indexes are crucial to improve the performance of queries. However, having more indexes makes writing new records to the database slower. So it's again a trade-off between query and write speed. + +## Data processing libraries on a single machine +- Pandas + - The standard tool for working with dataframes, and widely used in analytics or machine learning workflows. Note however how Pandas uses memory, because certain APIs create copies, while others do not. So if you are chaining multiple operations, it is preferable to use APIs that avoid copies. +- Vaex + - Vaex is an alternative that focuses on out-of-core processing (larger than memory), and has some lazy evaluation capabilities. +- Polars + - This is a much newer library, mostly written in Rust. Compared to pandas, it is multi-threaded and does lazy evaluation with query optimisation, so much more performant. However since it is newer, documentation is not as complete. It also allows you to write your own custom extensions in Rust. + +## Distributed/multi-node data processing libraries +- Dask + - `dask.dataframe` and `dask.array` provides the same API as pandas and numpy respectively, making it easy to switch. + - When working with multiple nodes, it requires communication across nodes (which is network bound). +- Ray +- Apache Spark From 0501b632582ee003914f388e1dbc3d772fbcaedd Mon Sep 17 00:00:00 2001 From: Suvayu Ali Date: Tue, 10 Sep 2024 15:24:08 +0000 Subject: [PATCH 3/7] best_practices/datasets.md: include Apache Arrow Co-authored-by: Patrick Bos --- best_practices/datasets.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/best_practices/datasets.md b/best_practices/datasets.md index b56fc6af..1bf72e72 100644 --- a/best_practices/datasets.md +++ b/best_practices/datasets.md @@ -10,6 +10,10 @@ You will encounter datasets in various file formats like: Or local database files like SQLite. It is important to note, the various trade-offs between these formats. For instance, doing a random seek is difficult with a large dataset for non-binary formats like: CSV, Excel, or JSON. In such cases you should consider formats like Parquet, or HDF5/NetCDF. Non-binary files can also be imported into local databases like SQLite or DuckDB. Below we compare some options to work with datasets in these formats. +It's also good to know about [Apache Arrow](https://arrow.apache.org), which is not itself a file format, but a specification for a memory layout of (binary) data. +There is an ecosystem of libraries for all major languages to handle data in this format. +It is used as the back-end of [many data handling projects](https://arrow.apache.org/powered_by/), among which a few others mentioned in this chapter. + ## Local database When you have a relational dataset, it is recommended that you use a database. Using local databases like SQLite and DuckDB can be very easy because of no setup requirements. But they come with some some limitations; for instance, multiple users cannot write to the database simultaneously. From 5da4240004ce744aea43e9c3109b13c1eec5c443 Mon Sep 17 00:00:00 2001 From: Suvayu Ali Date: Sat, 14 Sep 2024 22:08:40 +0000 Subject: [PATCH 4/7] Update best_practices/datasets.md: Polars description --- best_practices/datasets.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/best_practices/datasets.md b/best_practices/datasets.md index 1bf72e72..747ad492 100644 --- a/best_practices/datasets.md +++ b/best_practices/datasets.md @@ -46,7 +46,7 @@ SQLite is a transactional database, so if you have a dataset that is changing wi - Vaex - Vaex is an alternative that focuses on out-of-core processing (larger than memory), and has some lazy evaluation capabilities. - Polars - - This is a much newer library, mostly written in Rust. Compared to pandas, it is multi-threaded and does lazy evaluation with query optimisation, so much more performant. However since it is newer, documentation is not as complete. It also allows you to write your own custom extensions in Rust. + - An alternative to Pandas (started in 2020), which is primarily written in Rust. Compared to pandas, it is multi-threaded and does lazy evaluation with query optimisation, so much more performant. However since it is newer, documentation is not as complete. It also allows you to write your own custom extensions in Rust. ## Distributed/multi-node data processing libraries - Dask From 97e3f935eaf4e114496f7a8e0be13dc2f4af2580 Mon Sep 17 00:00:00 2001 From: Patrick Bos Date: Tue, 24 Sep 2024 11:22:29 +0200 Subject: [PATCH 5/7] make dataset title more specific Co-authored-by: Bouwe Andela --- best_practices/datasets.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/best_practices/datasets.md b/best_practices/datasets.md index 747ad492..93d4a9dd 100644 --- a/best_practices/datasets.md +++ b/best_practices/datasets.md @@ -1,4 +1,4 @@ -# Working with large datasets +# Working with tabular data There are several solutions available to you as an RSE, with their own pros and cons. You should evaluate which one works best for your project, and project partners, and pick one. Sometimes it might be, that you need to combine two different types of technologies. Here are some examples from our experience. From 151e248a4c2d3fdcaba7c1c93eee5808b9766263 Mon Sep 17 00:00:00 2001 From: Patrick Bos Date: Wed, 25 Sep 2024 10:09:33 +0200 Subject: [PATCH 6/7] add DataFusion to dataset chapter Co-authored-by: recap --- best_practices/datasets.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/best_practices/datasets.md b/best_practices/datasets.md index 93d4a9dd..0bb37797 100644 --- a/best_practices/datasets.md +++ b/best_practices/datasets.md @@ -47,7 +47,7 @@ SQLite is a transactional database, so if you have a dataset that is changing wi - Vaex is an alternative that focuses on out-of-core processing (larger than memory), and has some lazy evaluation capabilities. - Polars - An alternative to Pandas (started in 2020), which is primarily written in Rust. Compared to pandas, it is multi-threaded and does lazy evaluation with query optimisation, so much more performant. However since it is newer, documentation is not as complete. It also allows you to write your own custom extensions in Rust. - +DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in [Rust](http://rustlang.org/), using the [Apache Arrow](https://arrow.apache.org/) in-memory format. DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community. More info [Apache Datafusion](https://datafusion.apache.org/) ## Distributed/multi-node data processing libraries - Dask - `dask.dataframe` and `dask.array` provides the same API as pandas and numpy respectively, making it easy to switch. From 898ebbb96cdb6392788c4ed0081dc0f566bcb9c2 Mon Sep 17 00:00:00 2001 From: "E. G. Patrick Bos" Date: Wed, 25 Sep 2024 10:17:37 +0200 Subject: [PATCH 7/7] add datasets chapter to the sidebar --- _sidebar.md | 1 + 1 file changed, 1 insertion(+) diff --git a/_sidebar.md b/_sidebar.md index 05b46234..6487b56f 100644 --- a/_sidebar.md +++ b/_sidebar.md @@ -7,6 +7,7 @@ * [Documentation](/best_practices/documentation.md) * [Standards](/best_practices/standards.md) * [UX - User Experience](/best_practices/user_experience.md) + * [Datasets](/best_practices/datasets.md) * [Language Guides](/best_practices/language_guides/languages_overview.md) * [Bash](/best_practices/language_guides/bash.md) * [JavaScript and TypeScript](/best_practices/language_guides/javascript.md)