Replace sequential PubMed load with a parallel process #379

gaurav · 2024-12-12T17:40:05Z

I originally went with a sequential PubMed load because that would load publications in the order of update, allowing e.g. a later retitled publication to overwrite a previously titled one. However, loading all the 1,628 PubMed files sequentially takes a long time!

Some better ideas:

Convert all files into individual DuckDB databases, then run one query over all of them to ascertain the most recent title/publication status for each one.
Convert all files into e.g. TSV files in parallel, then run through the smaller TSV file sequentially to figure out most recent title/publication status.
???

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace sequential PubMed load with a parallel process #379

Replace sequential PubMed load with a parallel process #379

gaurav commented Dec 12, 2024

Replace sequential PubMed load with a parallel process #379

Replace sequential PubMed load with a parallel process #379

Comments

gaurav commented Dec 12, 2024