Skip to content

Commit

Permalink
Initial version
Browse files Browse the repository at this point in the history
  • Loading branch information
Rik D.T. Janssen committed Jan 10, 2023
0 parents commit 6c09737
Show file tree
Hide file tree
Showing 28 changed files with 3,801 additions and 0 deletions.
142 changes: 142 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# ###########################
# .gitignore for Ricgraph - Research in context graph.
# January 4, 2022.
# ###########################
ricgraph.ini
*.json
*.csv
*.xml
.idea

# ###########################
# Default .gitignore from GitHub on January 4, 2022.
# ###########################
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/
34 changes: 34 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Ricgraph - Research in context graph
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Rik D.T.
family-names: Janssen
orcid: 'https://orcid.org/0000-0001-9510-0802'
affiliation: Utrecht University
identifiers:
- type: doi
value: '[to follow]'
abstract: >-
Ricgraph (Research in context graph) is a graph
with nodes (sometimes called vertices) and edges
(sometimes called links) to represent objects and
their relations. It can be used to store,
manipulate and read metadata of any object that has
a relation to another object, as long as every
object can be "represented" by at least a *name*
and a *value*. In Ricgraph, one node represents one
object, and an edge represents the relation between
two objects. Metadata of an object are stored as
"properties" in a node, i.e. as information
associated with a node.
license: MIT
commit: commit id
version: '0.8'
date-released: '2023-01-10'
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2023 Rik D.T. Janssen

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
142 changes: 142 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# Ricgraph - Research in context graph

## What is Ricgraph?

Ricgraph (Research in context graph) is a
[graph](https://en.wikipedia.org/wiki/Graph_theory) with
nodes (sometimes called vertices)
and edges (sometimes called links) to represent objects and their relations.
It can be used to store, manipulate and read metadata of any object that
has a relation to another object,
as long as every object can be "represented" by at least a *name* and a *value*.
In Ricgraph, one node represents one object, and an edge represents the
relation between two objects.
It is written in Python and uses [Neo4j](https://neo4j.com)
as [graph database engine](https://en.wikipedia.org/wiki/Graph_database).

Metadata of an object are stored as "properties"
in a node, i.e. as information associated with a node.
For example, a node may store two properties, *name = PET* and
*value = cat*. Another node may store *name = FULL_NAME* and *value = John Doe*.
Then the edge between those two nodes means that the person with FULL_NAME John Doe
has a PET which is a cat.

The philosophy of Ricgraph is that it stores metadata, not the objects the metadata
refer to. To access an object, a node has a link to that object in
the system it was obtained from. The objective is to get metadata from
objects from a source system in a process called "harvesting".
All information harvested from several source systems will be combined into one graph.
Modification of metadata of an object is
done in the source system the object was
harvested from, and then reharvesting of that source system.

## Why Ricgraph?

Ricgraph has been developed because a university had a need to be able to show
people, organizations and research outputs
(e.g. books, journal articles, datasets, software, etc.)
in relation to each other. This information is stored in different systems.
That university needed to show research in context in a
graph (hence the name).
Ricgraph is able to answer questions like:

* Which person has contributed to which book, journal article, dataset,
software package, etc.?
* Given e.g. a dataset or software package, who has contributed to it?
* What identifiers does a person have (there are a lot in use at universities)?
* Show a network of persons who have worked together?
* For what organization does a person work? So which organizations have worked together?

Ricgraph provides example code to do this. We have chosen a
graph as a datastructure, since it is a logical and efficient
method to access objects
which are close to objects they have a relation to. For example,
starting with a person, its research outputs are only one
step away by following one edge, and other contributors to that research output are
again one step (edge) away.

In the remainder of this text, Ricgraph is described in the use case of
showing people, organizations and research outputs in relation to each other
in a university context.

### Example

In the figures below, nodes in green are datasets, nodes in yellow journal articles,
nodes in red software and nodes in blue person identifiers. Small nodes are harvested from
the data repository [Yoda](https://search.datacite.org/repositories/delft.uu),
medium-sized nodes from
the [Research Information System Pure](https://www.elsevier.com/solutions/pure),
and large sized nodes from the
[Research Software Directory](https://research-software-directory.org).
Click the figures to enlarge.

| one person with several research outputs | several persons with several research outputs |
|---------------------------------------------------|------------------------------------------------------|
| <img src="docs/images/rcg-all1.jpg" height="170"> | <img src="docs/images/rcg-all2-ab.jpg" height="200"> |

The left figure shows that a person has 5 identifiers (blue) and 3 journal articles (yellow)
from Pure,
2 datasets from Yoda (green) and 1 software package from the Research Software Directory (red).
*Person-root* is the central node to which everything related to a person is connected.
Information from several sources is combined in one graph.
The right figure shows a more extensive example. Two persons, A and B, have worked together on
a software package (red), a dataset (green), and something else (grey).
More examples can be found in [Ricgraph details](docs/ricgraph_details.md).

## What can Ricgraph do?

Some of Ricgraph's features are:

* Ricgraph stores metadata of objects.
The objective is to get metadata from
objects from a source system in a process called "harvesting".
That means that e.g. persons and publications
can be harvested from one system, datasets from another system, and software from a third system.
Everything found will be combined into one graph.
* Ricgraph can harvest from many sources, and you can write your own
harvesting scripts. Example scripts are included to
harvest from the [Research Information System Pure](https://www.elsevier.com/solutions/pure),
the data repository [Yoda](https://search.datacite.org/repositories/delft.uu),
and the [Research Software Directory](https://research-software-directory.org).
* Ricgraph can be used as an ID resolver. It can, given an identifier of a person,
easily find other identifiers of that person. When new identifiers are found when
harvesting from new systems,
they will be added automatically. It can form the core engine for the Dutch
[National Roadmap for Persistent
Identifiers](https://www.surf.nl/en/national-roadmap-for-persistent-identifiers).
* Since Ricgraph combines information from different sources in one graph, it
can be used as a discoverer (an aggregated search engine), such as the
[UU-discoverer](https://itforresearch.uu.nl/wiki/UU-discoverer).
Also, it can be used as a core engine for the
[Dutch Open Knowledge
Base](https://communities.surf.nl/en/open-research-information/article/building-an-open-knowledge-base).
* Ricgraph can check the consistency of information harvested. For example, ORCIDs and ISNIs
are supposed to refer to one person, so every node representing such an identifier should have
only one edge. This can be checked easily.
An example script is included.
* Ricgraph can enrich information. For example,
if a person has an ORCID, but not a Scopus Author ID,
[OpenAlex](https://openalex.org) can be used
to find the missing ID. If something is found, it is added to the person record.
An example script is included.
* Ricgraph can store any number of properties in a node.
It has function calls to
create, read (find), update and delete (CRUD) nodes and to connect two nodes.
* Ricgraph does not have an end user web interface yet. This is future work.
The graph can be explored using Bloom,
see [Execute queries and visualize the result using Bloom](docs/ricgraph_neo4j_bloom_use.md).

## How can you use Ricgraph?

* Read more about [Ricgraph details](docs/ricgraph_details.md),
such as example graphs, person identifiers and the *person-root* node.
* [Install and configure Ricgraph](docs/ricgraph_install_configure.md).
* Write code, or start reusing code,
see the [Ricgraph programming examples](docs/ricgraph_programming_examples.md).
* Or [do a harvest for Utrecht University datasets and
software](docs/ricgraph_programming_examples.md#harvest-of-utrecht-university-datasets-and-software).
You will observe that the information from two sources is neatly combined into one graph.
* [Execute queries and visualize the result using Bloom](docs/ricgraph_neo4j_bloom_use.md).
* Of course, there is [future work to do](docs/ricgraph_future_work.md). Please let me know
if you'd like to help.

Binary file added docs/images/neo4j1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/neo4j2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/neo4j3.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/rcg-all1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/rcg-all2-ab.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/rcg-all2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/rcg-ids1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/rcg-ids2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/rcg-resout1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/rcg-resout2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
55 changes: 55 additions & 0 deletions docs/ricgraph_details.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
## Implementation details

[Return to main README.md file](../README.md).

### Person identifiers

In the research world, persons can have any number of different identifiers.
Some of these are standard, generally accepted and more-or-less unique identifiers
over the lifetime of a person. These are called
[persistent identifiers](https://en.wikipedia.org/wiki/Persistent_identifier).
Others are non-unique, some are specific to an organization and some are specific to a company.
Examples are:

* persistent identifiers: [ORCID](https://en.wikipedia.org/wiki/ORCID),
[ISNI](https://en.wikipedia.org/wiki/International_Standard_Name_Identifier);
* non-unique identifiers: full name (there are persons with the same name);
* organization identifiers: employee ID, email address (will change when a person leaves
an organization);
* company identifiers:
[Scopus Author ID](https://www.scopus.com/freelookup/form/author.uri).

### Person-root node in Ricgraph

Ricgraph uses a special node *person-root*. This node is connected to all the different
person identifiers which have been harvested.
*Person-root* "represents" a person. Research outputs from a person
will also be connected to this *person-root* node.
The following figure shows two examples (click the figure to enlarge).

| a person with a few identifiers | a person with a lot of identifiers |
|-----------------------------------------------|-----------------------------------------------|
| <img src="images/rcg-ids1.jpg" height="130"/> | <img src="images/rcg-ids2.jpg" height="200"/> |

A person can have any number of identifiers.
The person in the left figure has one *ORCID*, one *ISNI* and one *FULL_NAME*.
The person in the right figure has a lot more identifiers, and some identifiers appear more than once.
E.g. this person has 2 SCOPUS_AUTHOR_IDs and 2 ISNIs.

### Research outputs connected to persons

| one person with three research outputs | three persons with one research output |
|-------------------------------------------------|-------------------------------------------------|
| <img src="images/rcg-resout1.jpg" height="200"> | <img src="images/rcg-resout2.jpg" height="130"> |

In both figures, nodes in blue are related to a person and nodes in yellow to journal articles.
The person in the left figure is identified by *FULL_NAME*, *ISNI* and *ORCID*,
which are connected to the *person-root* node (as in the previous section). This person
has three journal articles, identified by *DOI*. These are also connected to the *person-root* node.
In the right figure, there are three *person-root* nodes, representing three different persons
(other nodes with person identifiers are not shown for readability).
All these persons have contributed to the same research output, identified by *DOI*.

### Return to main README.md file

[Return to main README.md file](../README.md).
Loading

0 comments on commit 6c09737

Please sign in to comment.