LLMSemanticAnnotator: Advanced Semantic Annotation for Plant Biology Research

LLMSemanticAnnotator employs Semantic Textual Similarity (STS) to annotate scientific articles with controlled vocabulary, based on precise term definitions. This implementation primarily leverages ontologies from the Planteome project, while also incorporating other relevant sources, to address the lack of detailed annotations in scientific articles, particularly regarding experimental conditions and plant developmental stages.

Detailed Functioning

LLM Utilization: The annotator employs Large Language Models (LLMs) to deeply understand the context and content of scientific articles.
STS Application: The system compares the semantic similarity between ontological term definitions and article text, using advanced natural language processing techniques.
Ontology Sources: In addition to Planteome, the annotator integrates controlled vocabularies from other recognized sources in the field of plant biology, ensuring comprehensive coverage of relevant terms.
Multi-level Annotation: The annotation process specifically targets:
- Experimental conditions
- Plant developmental stages
- Molecules of interest under study
Semantic Association: Ultimately, the annotator establishes links between annotated terms, enabling the association of experimental conditions and developmental stages with the molecules of interest studied.

This approach aims to significantly enrich the metadata of scientific articles, thereby facilitating experimental reproducibility, comparative analysis of studies, and large-scale knowledge extraction in the field of plant biology.

Run

Usage: ./exec.sh <config_file> <int_commande>

Commands:
  1. Pseudo workflow [2,4,5,6,7]
  2. Populate OWL tag embeddings
  3. Populate abstract embeddings
  4. Compute similarities between tags and abstract chunks
  5. Display similarities information
  6. Build turtle knowledge graph
  7. Build dataset abstracts annotations CSV file

Details:
  2: Compute TAG embeddings for all ontologies defined in the populate_owl_tag_embeddings section
  3: Compute ABSTRACT embeddings (title + sentences) for all abstracts in the dataset
  4: Compute similarities between TAGS and ABSTRACTS
  5: Display similarities information on the console
  6: Generate turtle file with information {score, tag} for each DOI
  7: Generate CSV file with [doi, tag, pmid, reference_id]

Configuration file (json)

example can be found :

Configuration main keys

General Structure

{
    "encoder": string,
    "threshold_similarity_tag_chunk": number,
    "threshold_similarity_tag": number,
    "batch_size": number,
    "populate_owl_tag_embeddings": object,
    "populate_abstract_embeddings": object
}

Main Parameters

encoder: (string) Specifies the encoding model to use.
threshold_similarity_tag_chunk: (number) Similarity threshold for computing owl tag / chunk tags.
threshold_similarity_tag: (number) Similarity threshold between tags (keeps the best above this value).
batch_size: (number) Batch size for processing.

populate_owl_tag_embeddings

This section configures the ontologies to be used for populating OWL tag embeddings.

"populate_owl_tag_embeddings": {
    "ontologies": {
        "group_link": {
            "ontology_name": {
                "url": string,
                "prefix": string,
                "format": string,
                "label": string,
                "properties": [string],
                "constraints": object
            }
        }
    }
}

Ontology Parameters

url: (string) URL of the ontology.
prefix: (string) Prefix of the ontology.
format: (string) Format of the ontology (e.g., "xml").
label: (string) Property used as a label (Used to build embeddings).
properties: (array of strings) Additional properties to include (Used to build embeddings).
constraints: (object) Constraints to apply on the ontology.

populate_abstract_embeddings

This section configures the population of abstract embeddings.

"populate_abstract_embeddings": {
    "abstracts_per_file": number,
    "from_ncbi_api": object,
    "from_file": object
}

from_ncbi_api

Configures fetching abstracts from the NCBI API.

ncbi_api_chunk_size: (number) Chunk size for NCBI requests.
debug_nb_ncbi_request: (number) Number of requests for debugging (-1 for unlimited).
retmax: (number) Maximum number of results to return.
selected_term: (array of strings) Selected search terms.

from_file

Configures fetching abstracts from local files.

json_files: (array of strings) List of JSON files to use.
json_dir: (string) Directory containing JSON files.

Running Tests

To execute the test suite, you can use the following commands:

python3 -m venv llm_semantic_annotator_env
pip install -r requirements.txt 
source llm_semantic_annotator_env/bin/activate
python -m unittest discover

Run a specific test file

python3 -m venv llm_semantic_annotator_env
pip install -r requirements.txt 
source llm_semantic_annotator_env/bin/activate
python -m unittest tests/similarity/test_model_embedding_manager.py

python3 -m venv llm_semantic_annotator_env
pip install -r requirements.txt 
source llm_semantic_annotator_env/bin/activate
python -m llm_semantic_annotator.similarity_evaluator

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
.github/workflows		.github/workflows
config		config
data		data
llm_semantic_annotator		llm_semantic_annotator
tests		tests
.gitignore		.gitignore
README.md		README.md
check_slurm_memory.sh		check_slurm_memory.sh
exec-sbatch-gpu.sh		exec-sbatch-gpu.sh
exec.sh		exec.sh
llm-semantic-annotator.ipynb		llm-semantic-annotator.ipynb
main_msd_spark.py		main_msd_spark.py
msd_pubmed_extraction.md		msd_pubmed_extraction.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMSemanticAnnotator: Advanced Semantic Annotation for Plant Biology Research

Detailed Functioning

Run

Configuration file (json)

Configuration main keys

General Structure

Main Parameters

populate_owl_tag_embeddings

Ontology Parameters

populate_abstract_embeddings

from_ncbi_api

from_file

Running Tests

About

Releases 7

Packages

Languages

p2m2/encoder-ontology-match-abstract

Folders and files

Latest commit

History

Repository files navigation

LLMSemanticAnnotator: Advanced Semantic Annotation for Plant Biology Research

Detailed Functioning

Run

Configuration file (json)

Configuration main keys

General Structure

Main Parameters

populate_owl_tag_embeddings

Ontology Parameters

populate_abstract_embeddings

from_ncbi_api

from_file

Running Tests

About

Resources

Stars

Watchers

Forks

Releases 7

Packages 0

Languages

Packages