diff --git a/docs/ricgraph_future_work.md b/docs/ricgraph_future_work.md index fd77e80..1893e60 100644 --- a/docs/ricgraph_future_work.md +++ b/docs/ricgraph_future_work.md @@ -1,15 +1,23 @@ ## Future work * Create an end user web interface. This interface should allow - easy and faceted browsing of Ricgraph. + easy and faceted browsing of Ricgraph. + You can use *Ricgraph explorer*, but it is very basic now. * Modify Ricgraph to allow the use of another, preferably open source graph database engine. It should be possible by changing minor bits of the code in file *ricgraph.py*. * Make a web service of Ricgraph. * Write harvesting scripts to get information from e.g. [Zenodo](https://zenodo.org), - [ORCID](https://orcid.org), [OpenAlex](https://openalex.org), + [ORCID](https://orcid.org), ~~[OpenAlex](https://openalex.org)~~, [Scopus](https://www.scopus.com), [Lens](https://www.lens.org), [OpenAIRE](https://explore.openaire.eu), [DataCite Commons](https://commons.datacite.org), [GitHub](https://github.com) (and other Gits), etc. +* Function `merge_two_personroot_nodes()` in *ricgraph.py* now uses `_graph.delete()` + from *py2neo*, but that call has the side effect of removing nodes with more than one edge, + e.g. the organization nodes in *harvest_uustaffpages_to_ricgraph.py* + (after the call to `rcg.merge_personroots_of_two_nodes()` + and then `merge_two_personroot_nodes()` + there is only one organization node left). + It should use `_graph.separate()`, but the author did not get it working. [Return to main README.md file](../README.md). diff --git a/docs/ricgraph_programming_examples.md b/docs/ricgraph_programming_examples.md index 5670ad7..20b0c48 100644 --- a/docs/ricgraph_programming_examples.md +++ b/docs/ricgraph_programming_examples.md @@ -35,6 +35,40 @@ E.g., for research outputs you can adjust the years to harvest with the parameter *PURE_RESOUT_YEARS* and the maximum number of records to harvest with *PURE_RESOUT_MAX_RECS_TO_HARVEST*. +### Harvest of Utrecht University staff pages + +There is also a script for harvesting +the [Utrecht University staff pages](https://www.uu.nl/medewerkers), +*harvest_uustaffpages_to_ricgraph.py*. +This script needs the parameter *uustaff_url* to be set in the +[Ricgraph initialization file](ricgraph_install_configure.md#ricgraph-initialization-file). + +### Harvest of OpenAlex + +There is also a script for harvesting +the [OpenAlex](https://openalex.org), *harvest_openalex_to_ricgraph.py*. +It harvests OpenAlex Works, and by harvesting these +Works, it also harvests OpenAlex Authors. +This script needs the parameters *organization_name*, *organization_ror* +and *openalex_polite_pool_email* to be set in the +[Ricgraph initialization file](ricgraph_install_configure.md#ricgraph-initialization-file). + +There is a lot of data in OpenAlex, so your harvest may take a long time. You may +reduce this by adjusting parameters at the start of the script. Look in the section +"Parameters for harvesting persons and research outputs from OpenAlex": +*OPENALEX_RESOUT_YEARS* and *OPENALEX_MAX_RECS_TO_HARVEST*. + +### Order of running the harvesting scripts +The order of running the harvesting scripts does not really matter. The author harvests +only records for Utrecht University and uses this order: +1. *harvest_pure_to_ricgraph.py* (since it has a lot of data which is mostly correct); +1. *harvest_yoda_datacite_to_ricgraph.py* (not too much data, so harvest is fast, but it + contains several data entry errors); +1. *harvest_rsd_to_ricgraph.py* (not too much data); +1. *harvest_uustaffpages_to_ricgraph.py*; +1. *harvest_openalex_to_ricgraph.py* (a lot of data from a [number of + sources](https://docs.openalex.org/additional-help/faq#where-does-your-data-come-from)). + ### General program structure of a Python program using Ricgraph ```python